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ABSTRACT 


U.S.  Marine  Corps  logisticians  and  operational  planners  must  simultaneously  plan  for  the 
sustainment  of  current  operations  while  planning  for  future  operations.  Currently,  this 
process  is  hindered  by  the  manual  correlation  of  force  consumption  data  from  electronic 
and  hardcopy  documents. 

In  order  to  refine  this  process,  this  thesis  presents  a  process  for  converting, 
analyzing,  and  storing  these  documents  in  an  electronic  format.  In  order  to  aid  in  the 
conversion  process,  three  optical  character  recognition  (OCR)  applications  are  compared: 
an  open-source  and  freely-available  online  application,  Microsoft  OneNote®,  and 
Nuance  OmniPage®.  Two  data  extraction  programs  were  created  and  compared  to  assess 
the  feasibility  of  automating  the  analysis  phase.  The  first  program  concentrated  on 
automated  analysis  with  user  review  at  the  end.  The  second  program  concentrated  on 
continual  user  interaction  throughout  the  entire  process. 

The  results  of  these  comparisons  advocate  the  use  of  professional-grade  OCR 
software  such  as  OmniPage®  to  create  a  standard  file  that  can  be  accepted  as  an  input  by 
a  data  extraction  program.  Based  on  the  consumption  documents  reviewed  by  this  thesis, 
a  manual  data  extraction  program  is  advised  to  create  a  universal  output  format  for  later 
use  in  an  appropriate  data  storage  method. 
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I. 


INTRODUCTION 


A.  MOTIVATION 

The  United  States  Marine  Corps,  established  in  November  1775,  has  participated 
in  countless  operations  around  the  world.  Tasked  with  a  multitude  of  operations  ranging 
from  major  armed  conflicts  to  humanitarian  aid  missions,  U.S.  Marine  Corps  logisticians 
and  operational  planners  must  continually  plan,  maintain,  and  execute  acquisition  and 
distribution  programs  to  support  approximately  174,000  troops  according  to  [1], 
Regardless  of  the  scale  of  an  operation  or  mission,  they  must  always  answer  the 
fundamental  logistical  question  of  “how  much  equipment  and  supplies  do  we  need  for  the 
amount  of  personnel  assigned  to  the  mission?” 

In  order  to  answer  this  question,  they  must  locate  the  correct  reference  documents 
that  may  exist  in  electronic  and  hardcopy  formats,  retrieve  the  correct  consumption  data 
that  often  resides  in  “usage  tables,”  and  analyze  these  inputs,  providing  useful  planning 
data  for  utilization.  This  cyclic  process  is  depicted  in  Figure  1.  For  the  purpose  of  this 
thesis  and  from  a  logistical  standpoint,  “consumption  data”  is  an  all-encompassing  term 
that  describes  the  raw  data  contained  in  the  logistical  planning  factor  input  documents. 
When  a  specific  example  of  consumption  data  is  illustrated,  it  will  be  presented  as  a 
“consumption  data  element.”  The  term  “usage  table”  is  defined  in  the  next  section. 


Figure  1.  Consumption  Data  Correlation  Process 
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1.  The  Current  Methodology  of  Manual  Correlation 

Under  the  current  methodology,  the  location,  retrieval,  and  analyzing  of 
consumption  data  is  conducted  manually  by  civilian  and  military  personnel  in  the 
logistics  or  operations  field  from  variations  locations  throughout  the  world. 

a.  Location 

The  first  challenge  that  must  be  overcome  is  the  collection  of  the  correct  reference 
documents.  While  some  may  exist  in  an  electronic  format  such  as  a  portable  document 
format  (PDF),  others  exist  as  hardcopy  books,  field  manuals,  and  orders.  Thus,  the 
planner  must  have  access  to  both  electronic  and  hardcopy  resources.  Depending  on  their 
situation,  this  may  not  be  possible. 

b.  Retrieval 

The  next  challenge  is  locating  from  these  documents  the  correct  consumption  data 
that  often  reside  in  tables.  These  tables,  referred  to  as  “usage  tables,”  represent  the 
standard  display  format  of  consumption  data  elements.  Typically,  each  consumption  data 
element  is  listed  in  a  table  with  its  corresponding  consumption  rate.  Thus,  a  usage  table  is 
a  collection  of  consumption  data  elements  and  their  corresponding  rate  of  consumption  or 
allowance.  In  order  to  understand  the  end-user’s  use  of  this  tabular  data,  an  illustration  of 
one  usage  table  and  its  properties  is  given  as  an  example. 

(1)  Usage  Table.  Figure  2  illustrates  one  instance  of  a  usage  table.  Each  line, 
composed  of  four  properties  (columns),  represents  one  single  consumption  data  element. 
Note  that  for  the  example,  some  columns  are  not  included  for  legibility. 


Thre&t  Cambal  FuWk  TaW* 

SaquenciB 


NOfUilMUll 

DOOIC 

mmMJ 

GCE  RATES 
Daily 

BUSmM 

Bd47l 

BW71 

SQUAD  0  BMOU  Tim  SET 

SQUAD  DSMOLmON  SET 

C^W^Ge.  DEMO  BLOCK  1  LB  TWT 
CAP.  BLASTING  eLJECTR;C 

16.0O7S3 

1^.01845 

Figure  2.  Infantry-Heavy  Threat  Combat  Planning  Factors  Table  (from  [2]) 
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This  table  consists  of  the  following  columns: 

•  Weapon.  Consists  of  two  fields:  Weapon  ID  and  Nomenclature.  These 
two  fields  are  used  to  uniquely  identify  the  weapon.  An  expected  input  for 
this  field  is  alphanumeric  characters. 

•  Ammunition.  Consists  of  two  fields:  the  Department  of  Defense 
Identification  (DODIC)  and  Nomenclature.  These  two  fields  are  used  to 
uniquely  identify  the  ammunition  being  used  by  the  weapon  identified  in 
the  “Weapon”  column.  An  expected  input  for  this  field  is  alphanumeric 
characters. 

•  GCE  Rates.  This  column  is  used  to  define  the  consumption  rate  of  the 
Ground  Combat  Element  (GCE)  component  of  the  Marine  Air-Ground 
Task  Force  (MAGTF).  The  GCE  is  the  primary  attack  element  of  a 
MAGTF  and  is  expected  to  have  a  higher  rate  of  consumption.  An 
expected  input  for  this  field  is  an  integer  or  floating  point  (decimal) 
number.  For  elements  intended  for  GCE-use  only,  the  “OTHER-THAN 
GCE  RATES”  column  may  be  empty.  This  column  is  broken  down  further 
into  three  sub-columns: 

•  Daily  Assault.  This  rate  is  shown  as  the  number  of  rounds  per  day 
per  weapon  or  individual  in  the  GCE  during  the  assault  (intense) 
phase  of  combat  [2]. 

•  Daily  Sustain.  This  rate  is  shown  as  the  number  of  rounds  per  day 
per  weapon  or  individual  in  the  GCE  during  the  sustainment  phase 
of  combat  [2]. 

•  Basic  Allowance.  This  rate  indicates  the  basic  allowance  (BA)  of 
the  ammunition  item  recommended  to  be  carried  within  the  means 
normally  available  to  the  Fleet  Marine  Force  (FMF)  unit 
embarking  and  debarking  for  combat  operations  [2]. 

•  Other  than  GCE  Rates.  This  column  is  used  to  define  the  consumption 
rate  of  the  Command  Element  (CE),  Aviation  Combat  Element  (ACE), 
and  Combat  Service  Support  Element  (CSSE)  of  the  MAGTF.  Overall, 
this  column  has  the  same  characteristics  of  the  “GCE  RATES”  column: 
daily  assault,  daily  sustain,  basic  allowance,  and  the  use  of  integer  or 
floating  point  numbers.  For  items  intended  solely  for  the  CE,  ACE,  or 
CSSE,  the  “GCE  RATES”  column  may  be  empty. 

While  not  every  usage  table  published  by  the  United  States  Marine  Corps  may  be 
an  exact  replica  (data  and  content)  of  the  one  shown  in  Figure  2,  it  is  reasonably-expected 
that  each  of  them  will  follow  a  similar  table  layout  or  uniquely- structured  format.  Figure 
3,  for  example,  represents  the  same  information  from  Figure  2  as  a  table  from  a  different 
reference  document. 
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EXCXOSURZ  2:  CL AS.S  \  {W)  FA-06  PLAA'XI^G  FACTORS 
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Figure  3.  Class  V(W)  FY-06  Planning  Factors  Consumption  Data  (from  [3]) 


c.  Analyze 

Having  found  the  correct  reference  documents  and  usage  tables,  the  next 
challenge  faced  by  the  planner  is  to  analyze  the  usage  tables,  extracting  out  specific 
consumption  data  elements  out  as  necessary.  In  order  to  plan  for  an  operation  to  field 
100  Marines  for  example,  they  may  need  to  gather  20  consumption  data  elements  from  an 
electronic  usage  table  and  100  consumption  data  elements  from  a  different  hardcopy 
usage  table.  In  order  to  keep  track  of  these  elements  for  planning  an  operation,  the  most 
realistic  approach  is  to  record  them  into  a  single  location  in  a  universal  format.  This  is 
done  primarily  via  electronic  means — Microsoft  Word®,  Microsoft  Excel®,  text  files, 
etc.  Thus,  even  during  the  analyzing  phase,  they  must  juggle  between  electronic  and 
hardcopy  formats.  To  complicate  matters,  the  logistician  may  be  stateside  or  deployed, 
may  or  may  not  have  access  to  the  Internet,  may  or  may  not  have  hardcopy  usage  tables 
for  ready  reference,  and  may  not  have  an  extensive  planning  shop  at  his/her  disposal. 
Also,  the  user  must  have  some  familiarity  with  usage  tables  and  a  working  knowledge  of 
which  usage  tables  contain  specific  data  elements.  Since  hardcopy  usage  tables  do  not 
have  search  functionality,  inexperienced  planning  personnel  may  spend  countless  hours 
reading  through  a  usage  table  document  only  to  find  that  they  had  the  wrong  document.  A 
clear  benefit  of  having  electronic  usage  tables,  in  the  form  of  a  PDF  for  example,  is  that 
the  user  gains  the  ability  to  search  through  the  document  using  partial  or  full  keyword 
search  ability. 


d.  Utilize 

After  the  correct  information  has  been  located  and  compiled,  military  planners 
and  logisticians  provide  that  data  to  their  military  commander  for  planning  purposes  or 
use  the  data  to  accomplish  their  main  task — equipping  and  maintaining  troops.  Thus, 
depending  on  the  final  document,  different  variations  for  displaying  the  data  may  be 
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used — databases,  spreadsheets,  pie  charts,  etc.  While  different  storage  methods  are 
discussed  for  storing  the  output  of  the  extraction  programs,  this  stage  of  the  process  is 
outside  the  focus  of  this  thesis. 

2.  Present-day  Solutions 

While  no  systems  have  been  created  to  address  this  specific  conundrum,  several 
systems  have  been  created  to  aid  in  the  planning  process.  Systems  such  as  the  Joint 
Operations  Planning  and  Execution  System  and  The  Marine  Air  Ground  Task  Force  War 
Planning  System  have  been  used  as  resources  [3].  Some  end-users  have  taken  a  proactive 
approach,  creating  stand-alone  systems.  Using  programs  such  as  Microsoft  Excel®  and 
other  user-created  applications,  these  users  have  attempted  to  provide  temporary 
solutions  to  the  current  problem  as  depicted  in  Figures  4  and  5. 
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Figure  4.  Fuel  Planning  Worksheet  in  Microsoft  Excel®  (from  [3]) 
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Figure  5.  Spreadsheet  utilized  by  II  Marine  Expeditionary  Force  (from  [3]) 

Although  these  user-created  applications  are  made  with  good  intentions,  most 
suffer  from  the  same  deficiencies: 

•  Not  accepted  as  legitimate  applications  by  the  U.S.  Marine  Corps. 

•  May  contain  volatile  or  malicious  code  susceptible  to  attack. 

•  Not  widely-distributable  or  maintainable. 

•  May  contain  invalid  and/or  obsolete  information. 

While  these  solutions  are  innovative  and  have  some  merit,  a  more  stable  solution 
that  can  be  accredited,  upgraded,  and  distributed  to  all  users  in  the  Marine  Corps  in  a 
variety  of  environments  is  necessary. 

Another  simple  and  inexpensive  solution  for  today’s  environment  is  the  use  of 
historical  information.  When  constrained  by  time  or  resources,  planners  may  rely  on  data 
from  previous  operations  or  exercises  to  plan  for  an  operation.  Since  many  operations 
may  be  similar  in  nature,  planning  for  a  new  operation  may  be  as  simple  as  changing  the 
total  troop  count  or  total  vehicle  count.  This  methodology  has  two  main  drawbacks:  a 
lack  of  flexibility  and  the  potential  to  be  trusted  as  the  definitive  data  without  further 
examination.  With  the  changing  environment  of  the  world,  new  missions  may  be 
encountered  which  lack  historical  examples.  Not  only  does  this  lack  flexibility,  repetitive 
use  of  historical  data  can  erode  the  core  competencies  of  planners  and  may  provide 
estimates  that  are  clearly  inappropriate  for  the  given  scenario  based  on  the  different 
factors  surrounding  the  mission. 
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From  a  “cradle-to-grave”  standpoint,  the  current  methodology  of  locating, 
retrieving,  analyzing,  and  utilizing  consumption  data  is  a  tedious  and  laborious  task. 
Depending  on  the  type  and  amount  of  operations  being  conducted,  this  compilation 
process  may  become  exponentially  time-consuming  and  complicated.  Further  aggravated 
by  personnel  cuts,  this  process  places  an  undue  burden  on  the  planner. 

B.  PURPOSE  OF  STUDY 

This  thesis  strives  to  reduce  the  burden  placed  on  the  planner  by  answering  three 
main  research  questions: 

•  “What  are  the  abilities  and  limitations  of  current  OCR  technologies?” 

•  “What  is  the  best  method  for  analyzing  consumption  documents? 
Automated  analysis  with  review  at  the  end  or  walkthrough  analysis  with 
review  throughout  the  entire  process?” 

Figure  6  illustrates  how  answering  these  questions  relates  to  refining  the  current 
process. 


Create  Standard  Create  Standard 

Inputs  Outputs 


Figure  6.  Consumption  Data  Correlation  Process  (refined) 


The  first  question  addresses  the  problem  of  storing  both  electronic  and  hardcopy 


documents.  With  consumption  data  appearing  in  both  electronic  and  hardcopy 
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documents,  planners  must  spend  eonsiderable  time  colleeting  and  transeribing  this  data 
into  one  eentral  loeation.  Not  only  is  this  time-eonsuming  and  labor-intensive,  critieal 
consumption  data  may  be  omitted  or  ineorrectly-interpreted.  While  conducting  OCR  can 
be  labor-intensive  at  first  glanee,  the  use  of  an  OCR  application  can  reduce  the 
administrative  burden  placed  on  the  planner  by  (a)  allowing  them  to  eonvert  hardeopy 
documents  into  eleetronic  documents  which  can  be  searehed  and  (b)  allowing  them  to 
convert  pre-existing  eleetronie  doeuments  that  can  not  be  searehed  into  searchable 
documents.  These  searchable  doeuments  ean  also  be  used  as  inputs  to  data  extraetion 
programs  that  have  the  ability  to  extract  consumption  data  elements  and  provide  them  in 
a  useful  output.  For  example,  a  (key,  value)  pair  for  a  database  or  a  simple  text  file 
containing  solely  eonsumption  data  elements.  In  order  to  provide  an  aceurate  snapshot  of 
OCR,  three  off-the-shelf  OCR  applieations  were  compared:  an  open-souree,  freely- 
available,  online  application;  a  licensed  version  of  Mierosoft  OneNote®;  and  a  licensed 
version  of  Nuanee  OmniPage®.  The  aceuracy  rate,  capabilities,  and  limitations  for  each 
application  were  tested  and  reeorded  in  Chapter  IV.  As  a  byproduct  of  this  comparison,  a 
standard  text  file  was  created  that  was  later  used  by  the  analysis  programs.  This  text  file 
was  reviewed  and  eorrected  until  its  eontents  were  100%  aecurate,  creating  “perfect 
inputs”  for  the  analysis  programs. 

To  answer  the  second  question,  two  simple  analysis  programs  were  ereated  and 
compared.  These  programs  used  the  text  file  outputs  created  by  the  OCR  applications  as 
inputs.  The  goal  of  the  first  program  was  to  maximize  the  work  done  by  the  program  and 
reduee  the  amount  of  user  interaetion  necessary.  While  the  program  analyzes  the  input 
files  using  pre-defmed  logic  statements,  there  is  still  a  requirement  for  the  end-user  to 
review  and  verify  the  entries  at  the  end.  The  goal  of  the  second  program  was  to  compare 
an  alternative  approach,  requiring  continual  user  interaetion  throughout  the  decision¬ 
making  and  review  process.  This  program  was  ereated  in  two  versions:  line-by-line  and 
page-by-page.  The  byproduet  of  this  eomparison  was  the  ereation  of  a  standard  text  file 
and  a  (key,  value)  pair  in  the  form  of  a  Python  list  data  structure.  These  outputs  represent 
the  last  step  of  the  study.  Determining  the  best  data  storage  and  presentation  approaeh  for 
these  outputs  is  left  for  future  research. 
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c. 


PRO/CON  ANALYSIS  OF  AUTOMATION 


On  one  hand,  digitizing  and  storing  consumption  data  has  the  following  benefits: 

•  Reduction  in  manpower  hours  necessary  to  correlate  data. 

•  Allows  access  to  consumption  data  from  anywhere  in  the  world. 

•  The  ability  to  provide  consumption  data  in  one  location. 

•  Eliminates  the  need  to  keep  documents  in  hardcopy  form,  providing  cost 
savings  and  reducing  environmental  usage. 

•  The  ability  to  present  the  consumption  data  in  various  electronic 
formats — database,  charts,  tables,  text  documents,  etc. 

On  the  other  hand,  there  are  drawbacks: 

•  Higher  cost  in  database  /  application  administration  (personnel  and 
equipment  incurred. 

•  Requires  secondary  investment  in  security  and  network  monitoring 
personnel  and  systems. 

•  Depending  on  the  extensiveness  of  the  system,  it  may  require  dedicated 
personnel  to  operate  and  maintain  the  system. 

•  Relying  solely  on  the  software  may,  as  in  the  case  of  using  historical  data, 
erode  the  core  competency  of  the  end  user. 

•  Data  corruption  and/or  data  loss  could  result  in  a  lack  of  availability  for 
the  system  or  data. 

•  Should  this  process  be  completely  digitized  and  the  printing  of  hardcopy 
consumption  documents  be  ceased,  an  end-user  without  access  to  the 
system  would  be  unable  to  accomplish  their  tasks. 

While  these  benefits  and  drawbacks  may  not  be  all  encompassing,  the  progression 
towards  a  computer-aided  system  is  a  natural  progression  and  is  detailed  in  the  following 
chapters. 

D.  ORGANIZATION  OF  THESIS 

Chapter  II  conducted  a  background  study  by  presenting  two  forms  of  OCR,  free¬ 
form  and  template-based,  and  discusses  several  options  for  storing  the  output  from  the 
analysis  programs.  Chapter  III  discusses  two  approaches  for  analyzing  electronic 
consumption  documents:  automated  and  walkthrough.  Chapter  IV  compares  the  OCR 
applications  mentioned  in  Chapter  II  and  demonstrates  the  analysis  programs  presented  in 
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Chapter  III.  Chapter  V  covers  summary  results  of  Chapter  IV,  lists  recommendations 
reached  as  a  result  of  this  research,  and  illustrates  areas  for  continued  research. 
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II.  BACKGROUND  STUDY 


A.  WHAT  IS  OPTICAL  CHARACTER  RECOGNITION? 

This  field  of  study  was  summarized  in  1993  by  Line  Eikvil,  a  Norwegian  scientist 
who  specialized  in  pattern  recognition,  computer  vision,  and  text  mining  as  follows: 

Optical  Character  Recognition  deals  with  the  problem  of  recognizing 
optically  processed  characters.  Optical  recognition  is  performed  off-line 
after  the  writing  or  printing  has  been  completed  as  opposed  to  on-line 
recognition  where  the  computer  recognizes  the  characters  as  they  are 
drawn.  Both  hand  printed  and  printed  characters  may  be  recognized  but 
the  performance  is  directly  dependent  upon  the  quality  of  the  input 
documents.  [6] 

Although  there  has  been  refinement  and  improvement  in  OCR  technology,  this 
summary  still  represents  the  fundamental  principles  behind  the  process.  We  are  primarily 
concerned  with  a)  performing  optical  recognition  and  b)  ensuring  high  quality  input 
documents  are  supplied  to  the  process.  The  latter  of  which  is  a  universally-expected 
norm — in  order  for  any  application  to  provide  the  best  outputs,  it  must  be  given  the  best 
inputs.  As  a  means  to  an  end,  consumption  data  must  exist  on  a  computer  and  be 
recognized  by  the  analysis  program.  In  order  to  do  this  for  consumption  documents,  OCR 
was  leveraged  to  handle  two  cases: 

•  Consumption  data  contained  in  hardcopy  documents.  Here,  the 
document  exists  solely  in  hardcopy  format  and  OCR  must  be  conducted 
on  the  document  to  create  an  electronic  version. 

•  Consumption  data  contained  in  electronic  documents  but  not 
recognized  by  applications  as  a  machine-readable  format.  Here,  the 
document  exists  in  an  electronic  version  but  the  document  (or  parts  of  it) 
may  be  presented  as  data  that  an  application  can’t  interpret.  For  example, 
the  Python  programming  language  is  unable  to  natively  interpret 
Microsoft  Word  .docx  extensions  or  data  contained  in  PDF  files. 

1.  OCR  Software  Approaches 

In  order  to  best  understand  the  nature  of  OCR,  we  conducted  OCR  on  a  usage 
table  and  present  the  results  as  an  example  in  Figure  7.  Note:  this  data  was  originally 
presented  in  a  vertical  landscape  view.  For  the  purpose  of  the  example,  it  was  manually 
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converted  to  a  horizontal  profile  view.  Chapter  IV  addresses  the  ability  and  accuracy  of 
OCR  applications  to  conduct  this  procedure  automatically. 
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Figure  7.  Usage  Table — Pre-OCR  (from  [2]) 


Using  an  OCR  application,  this  table  was  processed  and  a  Microsoft  Word® 
document  was  created.  Figure  8  represents  this  document  with  errors  highlighted  in 
yellow. 
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160 

16.01045 

2.03084 

160 

W471 

SQUAD 

EMQLMONSET 

M131 

CAP.BLASOINGNON-ELBCIRC 

60.03300 

mm 

260 

60.00000 

3003033 

150 

80171 

SQUAD 

ENfocracNsr 

14458 

CORD.DEIQNADNGPOW 

1401.30104 

34045345 

1600 

1401.30498 

340.45313 

1600 

130471 

SQUAD 

EMQLMONSET 

MOM 

R2EBLASrI3^J^T!^E 

600.0000() 

21800000 

3000 

380.00000 

216.800d) 

1500 

80171 

SQUAD 

EMIXMONSET 

51767 

CHARGE  ASaMBLTTesDLUKN 

15.00733 

33^ 

50 

15,00763 

339005 

60 

80171 

SQUAD 

EMLTDONSEr 

14766 

IGNTIER.  mER^BLASUNG 

90.00000 

6400000 

300 

60.00000 

36001000 

ISO 

80471 

SQUAD 

EMOUDONSEr 

1.1LQ3 

FIR]NGREV^II^K)^tLTlFUEPOSE 

047673 

0.3309( 

21 

0.47673 

o5o^ 

12 

Figure  8.  Usage  Table — Post-OCR  (after  [2]) 


In  order  to  create  the  output  in  Figure  8,  the  OCR  application  went  line-by-line 
through  the  input  document,  recognizing,  interpreting,  and  transcribing  characters  as  it 
went  along.  The  output  illustrates  that  the  majority  of  the  data  was  transcribed  correctly. 
Most  of  the  errors  that  occurred  were  related  to  the  numerical  values  associated  to  the 
DODIC  and  the  various  rate  values  on  the  right-hand  side.  In  particular,  numerical  values 
that  contained  decimals  and  were  longer  encountered  the  highest  error  rate.  Of  note, 
the  OCR  application  was  intelligent  enough  to  place  the  output  data  into  a  table.  This 
kind  of  intelligence  helps  to  refine  and  preserve  the  data  for  later  analysis.  Although  the 
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application  generated  quite  a  few  errors,  this  may  not  be  an  issue  for  a  user  who  is  using 
OCR  for  a  simple  text  eonversion  tool.  However,  this  error  rate  is  eause  for  eoncern  in 
applications  that  must  maintain  accurate  information.  Before  diseussing  the  manner  in 
which  this  error  rate  ean  be  reduced,  it  is  important  to  understand  that  two  forms  of  OCR 
exist:  free-form  and  template-based. 

a.  Free-form  OCR 

The  output  produeed  in  Figure  8  represents  the  use  of  free-form  OCR.  This  is  the 
most  eommon  and  default  method  for  most  OCR  applications  and  the  method  used  by 
this  thesis.  While  it  allows  for  maximum  flexibility  of  input  formats  sueh  as  text,  tables, 
images,  and  alphanumeric  characters,  it  is  “considered  slow  and  inaecurate  at  times... 
however,  using  free-form  will  still  signifieantly  reduce  the  amount  of  errors  due  to  mis- 
keying  during  manual  data  entry”  [7].  Although  free-form  was  used  in  this  thesis,  should 
consumption  data  present  itself  in  a  predietable  fashion  in  the  future,  another  form  of 
OCR  may  provide  a  higher  degree  of  aceuraey  and  throughput — ^template-based  OCR. 

b.  Template-based  OCR 

Many  institutions  and  corporations  throughout  the  world  use  template-based  OCR 
to  conduct  data  entry  for  a  variety  of  systems.  One  such  data  entry  method  has  gained 
popularity  in  the  last  deeade — ^mobile  banking  deposit.  Many  major  banking  institutions 
allow  members  to  directly  deposit  paychecks  to  their  accounts  using  mobile-banking 
applications.  The  only  requirement  is  for  the  end  user  to  have  an  end-deviee  eapable  of 
running  the  application  and  creating  or  importing  an  image  (photograph)  of  the  cheek  to 
be  proeessed.  Again,  to  best  understand  this  method,  an  example  is  an  appropriate  venue. 
Figure  9  is  a  blank  cheek  for  illustrative  and  discussion  purposes. 
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Figure  9.  Blank  Check  for  Template-based  OCR  Discussion 

In  computing,  a  template  is  defined  as  “a  computer  document  that  has  the  basic 
format  of  something  (a  business  letter,  chart,  graph,  etc.)  that  can  be  used  many  different 
times”  [8],  A  check  follows  this  definition,  having  features  that  are  pre-de fined  and  used 
many  different  times  in  almost  all  other  checks.  With  regards  to  understanding  template- 
based  OCR  of  a  check,  a  check  has  the  following  distinguishable  features: 

•  Rectangular  in  shape,  having  four  comers  at  90-degree  angles. 

•  Standard  fields  to  denote  data  fields  (e.g.,  DATE,  PAY  TO  THE  ORDER 
OF,  $,  DOLLARS,  and  FOR). 

•  Magnetic  ink  character  recognition  (MICR)  font  information  at  the  bottom 
of  the  check — e.g..  Bank  Routing  Number,  Bank  Account  Number,  and 
Check  Number.  The  MICR  font  is  a  standard  of  the  American  National 
Standards  Institute  (ANSI)  and  was  specifically  created  for  recognition  on 
checks. 

While  financial  institutions  withhold  their  proprietary  software  procedures  and 

capabilities,  with  an  understanding  of  how  template-based  OCR  operates,  their  processes 

can  be  demystified  to  provide  an  understanding  of  how  this  technology  can  be  used  to 

interpret  consumption  documents.  First,  the  check  to  be  processed  must  be  filled  out  with 

all  the  necessary  fields  completed.  Note:  some  companies  have  software  capable  of 

detecting  when  fields  are  not  completed  and  provide  error  responses.  Second,  the  check 

must  be  entered  into  the  system  by  either  taking  a  picture  of  it  or  scanning  it  using  an 

input  device.  Some  applications  may  direct  the  input  of  the  check  from  within  the 
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application  (e.g.,  clicking  deposit  check  prompts  the  user  to  take  a  picture  of  the  front  and 
the  back  of  the  check).  Once  the  check  has  been  placed  in  an  electronic  format, 
depending  on  the  sophistication  of  the  application,  it  may  be  processed  at  the  end-device 
or  sent  to  a  data  processing  system  to  verify  the  accuracy  of  the  check.  This  is  the  phase 
where  template-based  OCR  processing  occurs.  During  this  processing,  the  following 
questions  are  answered: 

•  Is  the  document  a  check?  This  can  be  done  by  verifying  the  shape  and 
length  of  the  document  as  well  as  verifying  the  90-degree  angles  are 
present.  Some  companies  may  even  have  the  ability  to  detect  when  the 
comer  of  a  check  is  tom,  however,  such  information  is  simply  not  known 
due  to  the  proprietary  nature  of  the  software.  If  the  document  is  not  a 
check,  an  error  should  be  returned. 

•  Are  all  the  fields  filled  out?  This  is  done  by  observing  the  marks  that 
occur  within  the  standard  data  fields.  For  example,  a  valid  date  should  be 
after  the  DATE  label  and  a  valid  numerical  value  should  be  after  the  dollar 
sign  ($).  If  “$  ILUVCOOKIES”  was  written  on  the  check,  the  software 
should  be  intelligent  enough  to  understand  that  ILUVCOOKIES  is  not  a 
valid  numerical  value  and  return  an  error. 

•  Is  the  accounting  information  at  the  bottom  valid?  This  is  done  by 
interpreting  the  numerical  values  in  MICR  font  at  the  bottom.  Comparison 
of  the  name  on  the  check  to  the  account  holder  information  would  be  a 
likely  check  for  authenticity.  If  the  accounting  information  is  not  correct, 
an  error  should  be  returned. 

Finally,  based  on  the  results  of  the  processing,  the  user  should  be  given  an 
acceptance  or  rejection  status  of  the  overall  transaction.  Figure  10  illustrates  this  process 
from  beginning  to  end. 
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Figure  10.  Template-based  OCR  Process  of  a  Check 


Throughout  the  process,  the  accuracy  of  the  input  (the  check)  is  verified  against  a 
template  (features  of  a  check).  Should  the  check  deviate  from  the  accepted  template  of  a 
check  or  contain  erroneous  attributes,  the  software  should  default  with  an  error.  The 
important  takeaway  in  this  scenario  is  the  need  for  template  compliance,  accuracy,  and 
readability.  In  order  for  template-based  OCR  to  interpret  consumption  data,  the  data 
should  present  itself  repetitively  and  in  the  form  of  a  template  such  as  a  table.  Based  on 
the  consumption  documents  reviewed  by  this  thesis,  consumption  data  is  a  conglomerate 
of  free  text,  tables,  and  lists,  advocating  the  use  of  free-form  OCR. 

2.  Increasing  OCR  Accuracy 

Having  discussed  the  various  forms  of  OCR,  we  are  able  to  return  to  the 
discussion  of  improving  input  accuracy  with  an  appreciation  for  its  importance.  The 
usage  of  either  free-form  or  template-based  OCR  is  based  on  the  input  supplied  to  the 
OCR  application.  The  effectiveness  and  accuracy  of  the  application,  independent  of  the 
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OCR  format  chosen,  is  based  upon  the  clarity  and  readability  of  the  given  input.  In 
computer  science,  there  is  often  use  of  the  acronym  GIGO — garbage  in,  garbage  out. 
Should  the  applieation  for  analyzing  and  interpreting  consumption  data  receive  bad 
inputs,  it  will  most  likely  produee  bad  outputs.  Thus,  prior  to  converting  consumption 
data  using  OCR,  the  highest  emphasis  must  be  plaeed  upon  ensuring  the  “cleanliness”  of 
the  input  documents.  In  order  to  do  so,  the  following  steps  should  be  taken: 

(1)  Paper-to-electronic  conversion.  Should  a  consumption  document  exist 
solely  in  hardcopy  form,  the  best  known  copy  should  be  used  for  OCR  eonversion.  This 
document  should  have  minimal  interferenee  and  degradation.  For  example,  if  pages  are 
tom,  they  should  be  replaeed.  Smudges,  ink  blots,  and  extra  markings  inside  the 
document  should  be  removed.  OCR  may  attempt  to  recognize  these  markings  and 
produee  erroneous  results.  Note:  photoeopied  doeuments  tend  to  lose  quality  through 
blurring  and  fuzzing  and  should  be  used  as  a  last  resort. 

(2)  Electronic-to-electronic  conversion.  Should  a  consumption  document 
exist  solely  in  eleetronic  form,  the  best  known  copy  should  be  used  as  the  input 
document.  Again,  the  doeument  should  be  inspeeted  with  the  same  regard  as  the  paper  to 
electronic  conversion.  Should  the  electronic  document  be  of  poor  quality,  another 
document  should  be  used  or  the  current  one  rewritten. 

(3)  Document  formatting.  For  optimal  processing,  consumption  data  should 
be  displayed  in  the  appropriate  format — text,  table,  eolumns,  rows,  etc.  While  OCR  will 
attempt  to  identify  and  render  this  data  intelligently,  giving  it  the  desired  input  for  an 
expected  result  is  recommended.  For  example,  if  you  wanted  the  OCR  applieation  to 
create  a  table  of  data,  provide  it  with  a  table  of  data.  When  possible,  all  pages  should  be 
presented  in  the  same  page  layout.  For  example,  all  pages  in  the  document  should  be 
presented  in  either  portrait  or  landseape  format — not  a  mixture  of  the  two. 

Although  some  of  these  steps  may  be  manpower  intensive  up  front,  the  dividends 
they  pay  in  the  long  run  may  outweigh  the  eosts  ineurred  with  auditing  the  OCR  output. 
Reformatting  and  retyping  a  document  may  take  several  hours  or  even  days.  Likewise, 
the  same  amount  of  time  may  be  spent  auditing  and  eorrecting  the  OCR  output  if  the 
OCR  software  misinterprets  inputs  and  provides  unreadable  and/or  gibberish  outputs.  For 
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example,  the  software  may  present  100  lines  of  data  in  portrait  view  when  the  original 
input  was  3  lines  of  data  presented  in  landscape  view.  Thus,  it  is  incumbent  upon  the 
person  who  collects  the  electronic  and  hardcopy  consumption  documents  for  OCR  to 
make  the  decision  as  to  whether  or  not  the  document  is  in  an  acceptable  state. 

3.  OCR  Summary 

OCR  comes  in  two  forms:  free-form  and  template-based.  Due  to  the  nature  of  the 
consumption  documents  reviewed  by  this  thesis,  free-form  OCR  is  leveraged.  Emphasis 
is  placed  upon  refining  the  input  documents  prior  to  OCR.  In  order  to  leverage  OCR  to  its 
maximum  effectiveness,  it  should  be  presented  with  the  best  inputs.  The  same  mentality 
can  be  applied  to  the  process  of  making  wine:  poor-quality  grapes  can  seldom  be 
mitigated  by  the  winemaker. 

B.  METHODS  FOR  STORING  CONSUMPTION  DATA 

The  byproduct  of  the  analysis  programs  compared  in  Chapter  IV  created  two 
outputs:  a  standard  text  file  and  a  (key,  value)  pair.  To  illustrate  the  rationale  behind  this 
decision,  the  following  data  storage  methods  are  discussed:  traditional  databases  and  a 
document  repository. 

1.  Traditional  Database 

A  database  can  be  leveraged  to  store  consumption  data.  Two  types  of  databases 
currently  exist:  relational  and  non-relational. 

a.  Relational  Database  Model 

Inside  of  a  relational  database,  data  is  represented  in  a  schema  (a  framework), 
consisting  of  tables  that  have  interconnecting  relationships.  Data  may  be  spread  across  as 
few  as  one  table  or  as  many  as  thousands  in  order  to  correctly  represent  entities  and 
relationships  between  the  data  elements.  Each  element  (entity)  in  a  database  table  has  a 
unique  identifier,  referred  to  as  the  primary  key,  which  prevents  duplicate  information 
from  existing.  Multiple  entities  are  mapped  together  using  relationship  tables.  In  order  to 
make  this  data  available  to  the  end-user,  a  database  server  is  created  and  hosted,  allowing 
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users  to  query  the  database.  In  order  to  query  the  database  server,  a  user  will  typically 
build  queries  using  structured  query  language  (SQL)  constructs.  Figure  1 1  illustrates  how 
one  consumption  data  element  may  be  represented  in  a  relational  database.  Figure  12  is 
an  illustration  of  the  query  submitted  and  the  query’s  result  based  on  the  data  shown  in 
Figure  11. 


WcapoD  Table  (Entity') 


Weapon  ID 

Nomenclature 

B0471 

Squad  Demolition  Set 

Ammunition  Table  (Entity) 


DODIC 

Nomenclature 

M032 

Charge,  Demo  Block  1  LB  TNT 

GCE  Rate  Table  (Entitj') 


DODIC 

Daily  ASSAULT 

Daily  SUSTAIN 

Basic  Allowance 

M032 

15.00753 

3.39605 

48 

Other  than  GCE  Rates  Table  (Entity  ) 


DODIC 

Daily  ASSAULT 

Daily  SUSTAIN 

Basic  Allowance 

M032 

15.00753 

3.39605 

15 

Part  of  Table  (Relationship  mapping  between  Weapon  and  Ammunition) 


Weapon  ID 

DODIC 

M032 

M032 

Rates  for  Table  (Relationship  mapping  between  Ammunition  and  Rates) 


DODIC 

Daily 

Daily 

Basic 

Daily 

Daily 

Basic 

ASSAULT 

SUSTAIN 

Allowance 

ASSAULT 

SUSTAIN 

Allowance 

M032 

15.00753 

3.39605 

48 

15.00753 

3.39605 

15 

Figure  1 1 .  Consumption  Data  Element  in  Relational  Form 
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select  • 


from  Weapon,  Ammunition,  GCE  Rate,  Other  than  GCE  Rates,  Part  of.  Rates  for 
where  Weapon.WeaponlD  =  '‘30471’'  and  Ammunition.DODIC  =  “M032” 


Weapon 

ID 

Nomenclature 

DODIC 

Nomenclature 

B0471 

SQUAD 

DEMOLITION  SET 

M032 

CHARGE,  DEMO  BLOCK  1  LB  TNT 

B0471 

SQUAD 

DEMOLITION  SET 

M130 

CAP,  BLASTING  ELECTTC 

Daily 

Daily 

Basic 

Daily 

Daily 

Basic 

Assault 

Sustain 

AUowance 

Assault 

Sustain 

Allowance 

15.00753 

3.39805 

48 

15.00753 

3.39605 

15 

18.01945 

4.00000 

150 

18.01945 

2.03084 

150 

Figure  12.  Relational  Database  Query  and  Result 


Figures  1 1  illustrates  the  main  drawback  of  implementing  a  relational  database. 
Although  only  one  consumption  data  element  was  given,  six  tables  had  to  be  created  in 
order  to  correctly  represent  all  of  its  data  elements.  While  this  may  seem  insignificant  at 
first  glance,  this  problem  becomes  more  pronounced  when  data  is  spread  across  hundreds 
or  thousands  of  tables.  Here,  a  cost  in  computing  time  is  incurred  to  search  through  each 
table,  establish  relationships,  and  present  subsequent  tables.  Additionally,  any  inputs  into 
the  system  must  strictly  adhere  to  the  existing  format,  or  “schema.”  Any  inputs  that 
deviate  from  the  appropriate  input  format  will  be  rejected  or  cause  an  error. 

b.  Non-relational  Database  Model 

The  main  goal  of  using  NoSQL  (“Not  Only  SQL”)  is  to  break  away  from  the 
problems  associated  with  maintaining  relationships  in  relational  databases.  Since  a 
relational  database  must  keep  track  of  the  relationships  contained  within  the  database,  it 
must  create  extra  tables  in  order  to  do  so.  This  problem  is  avoided  with  NoSQL 
implementations  by  offering  a  variety  of  opportunities  to  store  data.  One  of  the  common 

approaches  is  the  use  of  a  (key,  value)  pair  to  represent  a  single  unique  data  element.  This 
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(key,  value)  approaeh  illustrates  the  use  of  a  fundamental  eomputer  scienee  data 
strueture — the  dictionary.  The  National  Institute  of  Standards  and  Technology  (NIST) 
define  a  dictionary  as  “an  abstract  data  type  storing  items,  or  values.  A  value  is  accessed 
by  an  associated  key.  Basic  operations  for  manipulating  a  dictionary  are  new,  insert,  find 
and  delete”  [5].  Figure  13  illustrates  how  one  consumption  data  element  may  be 
represented  in  dictionary  format.  Figure  14  is  an  illustration  of  the  query  submitted  in 
Java  and  the  result  which  would  be  shown  using  the  data  shown  in  Figure  13. 


Weapon  Diction  a  ty 


Weapon  ID  (ke>') 

Consumption  Information  (value) 

B0471 

SQUAD  DEMO  SET,  M032,  CHARGE,  DEMO  BLOCK  LB  TNT, 

15.00753,  3.39605,  48,  15.00753,  3.39605,  15,  M130,  C^, 

BLASTING  ELECTRIC,  18.01945,  4.00000,  150,  18.01945, 

2.03084,  150,  ... 

Figure  13.  NoSQL  (key,  value)  Dictionary  Example 
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Program  Code 

public  static  \'ciid  main{String[]  args)  { 

HashMap  map  =  new  HasliMap{); 

This  is  where  values  are  entered  into  the  dictionary' 

map.put{“B047r’,  SQUAD  DEMO  SET,  M032,  CHARGE,  DEMO  BLOCK  1 
LB  TNTT,  15.00753,  3.39605,  48,  15.00753,  3.39605,  15,  M130,  CAP, 
BLASTING  ELECTRIC,  18.01945,4.00000,  150,  18.01945,2.03084,  150,  ...); 

String  querv'Result  =  map.get('''B0471”);  <-  This  is  where  values  are  retrieved 

. . .  parsing  and  string  manipulation  code  not  sho^m  for  simplicity'.  In  a  real-^^rld 
application,  parsing  logic  would  require  CPU  resources  of  the  end  dev'ice 
requesting  information  from  the  serv'er. 

Program  Output 

/AVhile  one  form  of  potential  output  shown,  there  are  unlimited  possibilities  of 
representing  this  data  to  the  end  user. 

Search  results  for  “B0471”  are; 


WeaponID;  B0471 

Nomenclature;  SQUAD  DEMO  SET 

Ammunition; 

Subcomponent  1; 

DODIC;  M032 

Nomenclature;  CHARGE,  DEMO  BLOCK  1  LB  TNT 
GCE  Daily  Assault;  15.00753 
GCE  Daily  Sustain;  3.39605 
GCE  Basic  Allow'ance;  48 
Other  than  GCE  Daily  Assault;  15.00753 
Other  than  GCE  Daily  Sustain;  3.39605 
Other  than  GCE  Daily  Basic  AMow'ance;  1 5 
Subcomponent  2; 

Subcomponent  3; 

Figure  14.  Java-implemented  Dietionary  and  Query:  Result 


While  a  (key,  value)  implementation  is  diseussed  and  used  as  the  output  format  of 
the  analysis  program  in  this  thesis,  JavaScript  Object  Notation  (JOSN)  and  extensible 
markup  language  (XML)  formats  also  exist. 
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c.  Database  Architecture 

Figure  15  illustrates  how  a  database  can  be  implemented.  Using  the  illustration  as 
a  guide,  we  further  discuss  each  component  of  the  database  infrastructure  in  detail. 


Secondary 


Primary 


Secondary 


MILCOM 

Satellite 


Secondary/ 
Stand  -  Alone 


Figure  15.  Global  Database  Architecture 


The  following  components  are  necessary  to  implement  a  database: 

•  Primary  Server-The  primary  server  is  responsible  for  providing  each 
secondary  server  the  most  valid  and  updated  information. 

•  Secondary  Server(s)-To  distribute  the  processing  load  off  the  primary 
server,  secondary  servers  are  created  to  handle  transactions.  Commonly, 
these  servers  are  implemented  in  various  geographic  locations  to  not  only 
distribute  the  load,  but  provide  faster  responses  to  the  end  user. 

•  Secondary/Stand-Alone  Server(s)-Similar  to  the  secondary  server,  a 
stand-alone  server  could  be  implemented  at  a  forward  operating  base  or 
onboard  a  ship  in  an  amphibious  ready  group  (ARG).  Figure  15  illustrates 
the  use  of  a  server  in  an  ARG  environment.  Here,  updates  can  be  sent  back 
and  forth  between  the  stand-alone  database  and  the  primary  server  when 
connectivity  is  available  over  a  military  communications  (MILCOM) 
satellite  or  other  means.  During  periods  of  non-connectivity,  elements  in 
the  database  may  become  obsolete. 

•  Intercommunication  Capability/MILCOM  Satellite-Connections  between 
the  land-based  primary  and  secondary  servers  could  be  done  over  a  variety 
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of  mediums — direct  link,  satellite,  dedicated  line,  etc.  Communications 
between  a  database  at  a  FOB  or  an  ARG  requires  reach-back  capability  via 
satellite  communications  using  either  a  MILCOM  satellite  or  commercial 
provider,  such  as  INMARSAT.  Such  communication  is  essential  to 
replicating  changes  from  the  primary  database  to  each  secondary 
implementation. 

Database  locations  require  additional  hardware  and  software: 

•  Storage  devices-required  for  storing  the  database  software  and  data 
contained  in  the  database  itself 

•  Network  devices-provides  connectivity  in  and  out  of  the  database. 

•  Input  devices-mouse,  keyboard,  scanners,  etc. 

•  Database  software-necessary  to  run  the  database. 

Figure  16  illustrates  the  necessary  components  for  a  database. 


Figure  16.  Database  Location  Hardware  Instance 
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2. 


Document  Repository 


For  the  purpose  of  this  discussion,  we  reference  the  Department  of  Defense 
(DOD)  website  that  lists  all  valid  DOD  instructions.  Figure  17  displays  a  portion  of  this 
site  for  illustrative  and  discussion  purposes  as  shown  in  [4], 


Copy  to  dipboard  CSV  Excel  PDF  Print 

ISSUANCE  A 
NUMBER  ^ 

ISSUANCE  A 
DATE  ^ 

ISSUANCE  A 

SUBJECT  ^ 

CHANGE 

# 

I DENTin CATION  (ID) 

DqDI 

1000.01 

4/16/2012 

CARDS  REQUIRED  BY 
THE  GENEVA 

CONVENTIONS 

DqDI 

1000.04 

FEDERAL  VOTING 

9/13/2012 

ASSISTANCE  PROGRAM 
(FVAP) 

DqDI 

1000.11 

HNANCIAL 

1/16/2009 

INSTITimONS  ON  DOD 

INSTALLATIONS 

I DENTin CATION  (ID) 
CARDS  FOR  MEMBERS 

DqDI 

1000.13 

OF  THE  UNIFORMED 

1/23/2014 

SERVICES,  THEIR 
DEPENDENTS,  AND 
OTHER  ELIGIBLE 

INDIVIDUALS 

PROCEDURES  AND 

SUPPORT  FOR  NON- 

DqDI 

1000.15 

10/24/2008 

FEDERAL  ENTITIES 

AUTHORIZED  TO 

OPERATE  ON  DOD 

Showing  1  to  714  of  714  entries 

Figure  17.  DOD  Instructions  in  Circulation  (from  [4]) 


This  website  gives  the  appearance  of  a  database.  The  information  is  displayed  in  a 
table  with  columns.  It  has  search  and  fdter  capability  with  fiill  and  partial-match 
functionality.  The  website  is  accessed  via  a  compatible  web  browser  such  as  Microsoft 
Internet  Explorer,  Mozilla  Firefox,  or  Google  Chrome.  While  this  structure  gives  the 
appearance  of  a  traditional  relational  or  non-relational  database,  it  has  been  created  using 
JavaScript  and  HTML.  When  a  hyperlink  is  clicked,  the  user  is  directed  to  the  resource 
that  is  located  in  the  web  server’s  directory.  Thus,  no  “query:  result”  is  conducted.  When 
the  user  clicks  a  hyperlink,  an  HTTP  GET  request  is  sent  to  the  server.  The  web  server 
then  sends  the  material  requested  back  to  the  user.  In  the  case  of  this  example,  a  GET 
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request  returns  a  document  that  exists  as  a  PDF  in  the  server’s  directory.  This  method 
differs  from  a  traditional  database  in  the  way  data  is  stored.  Unlike  a  database,  an  entire 
fde  is  stored  in  a  directory  that  can  be  accessed.  Users  are  able  to  search  and  retrieve  by 
document  name  but  do  not  have  the  ability  to  search  all  the  fdes  at  once. 

3.  Data  Storage  Summary 

Traditional  databases  can  be  leveraged  to  store  consumption  data.  Relational 
databases  can  maintain  relationships  between  entities — “Weapon  and  Ammunition”  for 
example,  but  may  become  cumbersome  with  large  quantities  of  data.  Non-relational 
databases  circumvent  this  problem  by  relaxing  input  types  and  eliminating  the  need  to 
maintain  relationships  between  data  elements.  However,  non-relational  databases  are 
less  mature  and  few  people  are  skilled  to  maintain  them.  Alternative  methods  for  storing 
data  exist.  Storing  consumption  documents  in  text  fdes  on  a  web-server  is  one  such 
alternative. 
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III.  ANALYSIS  PROGRAMS 


A.  PROGRAM  CREATION 

Once  the  consumption  documents  have  been  converted  into  an  electronic  format 
and  collected  in  a  centralized  location  by  the  planner,  the  process  of  analyzing  the 
contents  of  these  documents  may  begin.  Currently,  no  programs  exist  that  have  the  sole 
purpose  of  extracting  consumption  data  elements.  This  is  an  important  statement  because 
it  illustrates  the  infancy  of  such  a  program.  Programs  are  often  created  by  understanding 
and  copying  the  core  functionality  and  limitations  of  another  similar  program.  Thus,  we 
must  look  at  this  program  from  the  ground  up. 

1.  Language  Selection 

Each  and  every  program  in  existence  is  created  out  of  lines  of  code.  These  lines  of 
code  are  written  using  a  programming  language.  One  such  language.  Python,  is  a  widely- 
known  and  implemented  language  and  will  be  used  as  the  language  of  choice  for  this 
thesis.  It  can  be  explained  easily  (compared  to  the  other  languages)  and  avoids  many 
restrictions  that  other  languages  must  enforce. 

2.  Design  and  Functionality  Specification 

Figure  18  illustrates  the  flow  of  data  in  and  out  of  the  proposed  extraction 
programs,  providing  a  framework  for  discussion. 
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Figure  18.  Application  Flow 


a.  Data  Import  (1.1 ) — Open  the  Input  File 

The  first  step  the  program  must  take  is  the  opening  of  an  input  file.  A  number  of 
input  file  types  may  exist  depending  on  the  OCR  software  used  to  create  them.  However, 
these  file  types  must  be  recognized  by  the  program.  To  prevent  any  problems,  the 
programs  presented  in  this  thesis  used  a  simple  text  file,  specifically  the  use  of  a  .txt 
extension.  Figure  19  illustrates  a  simple  file-open  routine  that  can  be  used  by  the  program 
to  open  a  file  and  count  the  number  of  lines  it  has.  This  is  an  important  step  because  it 
ensures  the  program  can  conduct  basic  operations  on  the  input  file. 
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Wotej  tent  fallowing  a  pound  sign  (J^)  represents  a 

#  comment  and  is  not  executable  code. 

#  This  progranrii  asks  the  end  user  to  specify  the  name  of  an 

#  input  document.  Once  the  user  has  done  so,  the  file  is  then 
f|  opened  and  t  he  number  of  lines  are  counted^ 

print  ("Please  enter  the  file  name  of  the  input  document: ") 

filename  -  input()  ^  Here  the  user  specifies  the  input  name. 

filename  =  filename  +  ".txt"  #  We  append  .txt  file  extension, 

file  =  open(filename,  V)  #  We  open  the  file. 

lineCount  - 0 

for  line  in  file: 

lineCount  =  IlneCount  +  1  #  And  then  count  the  number  of  lines, 
print  ('"Line  count  for "  + filename  +  is: ^  strf  lineCount)) 

- OUTPUT - - - 

Please  enter  the  file  name  of  the  input  document: 
consumptiondata 

Line  count  for  consumptiondata.txt  is:  3 

Figure  19.  Open  Input  File  Routine  with  Line  Counting  Logic 
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This  simple  program  first  prompts  the  user  for  an  input  file  name  - 
“consumptiondata”  in  this  example.  Second,  the  program  appends  the  .txt  file  extension. 
This  can  be  removed  if  the  user  is  restricted  to  entering  a  filename  ending  with  .txt. 
Third,  the  file  is  opened  in  read-only  mode  specified  by  the  “r”  option.  Other  access 
options  exist,  namely  the  write  option,  which  will  be  discussed  in  the  “Export  to  File” 
section.  Lastly,  the  program  walks  through  the  file  and  counts  each  line  as  it  is 
encountered.  This  number  is  then  printed  to  the  screen.  In  this  example, 
“consumptiondata.txf’  contained  three  lines  which  the  program  correctly  interpreted. 
Note  that  for  simplicity,  and  in  light  of  the  OCR  conversion  to  be  discuss  later,  this 
rudimentary  program  restricts  the  user  to  files  with  a  “.txf  ’  extension;  however,  a  more 
complex  implementation  might  present  the  user  with  a  “chooser”  window  (“modal  box”) 
that  allows  the  user  to  select  (“choose”)  a  file  from  a  drop-down  list  of  files  contained 
within  a  given  directory  (“folder”)  stored  within  the  host  system. 

b.  Read  in  Inputs  and  Store,  Close  Input  File  (1.2  and  1.3) 

Once  the  input  file  is  open,  data  can  be  read  out  of  the  file  and  manipulated  in  the 
program’s  memory  space.  Once  the  contents  of  the  file  have  been  read  as  input,  the  file 
can  then  be  closed.  Figure  20  illustrates  a  program  that  can  handle  these  tasks. 
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#  This  program  reads  in  the  Inputs  from  the  open  file 

#  and  places  them  into  a  list  for  further  processing. 

#  Once  the  contents  have  been  read  in^  the  input  file  is 
f|  closed  and  contents  of  the  list  is  printed. 

consumptlonData  -  []  tt  Python  list  data  structure 

listSIze  =  0  #  Counters  for  program  termination 

i  =  0 

for  line  in  file; 

consumptionData.appendfline)  W  Place  each  line  in  the  list 

listSIze  =  listSize  +1  #  Keep  track  of  the  amount  of  elements 

print  ("Number  of  lines  In  the  Iht  Is: "  *  str(ltstSfze)  +  ''\n'') 

file.closeO  ff  Close  the  input  file,  no  longer  needed 

while  (i  <  listSize); 

print  tconsLirnptionDatafl))  If  Print  out  each  line  of  input  from 

i  =  ]  +  1  ft  the  secondary  data  structure 

- - - OUTPUT - 

Number  of  elements  in  the  list  is:  3 

00471  SQUAD  DEMOLITION  SET  M032  CHARGE,  DEMO  BLOCK  1  LB  TNT 

00471  SQUAD  DEMOLITION  SET  M130  CAP,  BLASTING  ELECTRIC 

00471  SQUAD  DEMOLITION  SET  M 131  CAP,  BLASTING,  NON  ELECTRIC 

Figure  20.  Read-in  and  Storage  of  Inputs 

This  program  begins  by  creating  a  Python  list  data  structure  named 
“consumptionData”  with  two  program  counters  to  control  program  execution  and 
termination.  Once  these  have  been  initialized,  each  line  of  the  input  file, 
“consumptiondata.txt,”  is  read  into  the  list  data  structure.  For  tracking  purposes,  the  size 
of  the  list  is  incremented  as  each  new  element  is  placed  in  the  list.  This  allows  us  to  keep 
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track  of  the  size  of  the  list  in  the  variable  “listSize.”  After  all  the  data  elements  have  been 
read,  the  fde  is  elosed  and  all  the  elements  are  printed  to  the  sereen. 

c.  Analyze,  Verify,  and  Correct  Inputs  (2. 1-2.3) 

At  this  point,  the  consumption  data  elements  have  been  aeeessed  by  the  program 
and  can  be  further  analyzed  and  manipulated  without  the  need  of  the  input  fde.  Two 
different  approaehes  may  be  used  to  handle  this  phase: 

•  Walkthrough  Analysis.  Using  this  approach,  the  end  user  is  heavily- 
involved  in  the  deeision-making  proeess  on  a  line-by-line,  paragraph-by- 
paragraph,  or  page-by-page  basis.  Having  been  read  into  the  program- 
provided  data  strueture,  a  line,  paragraph,  or  page  of  potential 
eonsumption  data  is  presented  to  the  user.  The  user  must  then  make  a 
decision  of  “yes  or  no”  that  the  item(s)  presented  is/are  valid  data 
element(s).  If  the  user  responds  “yes,”  the  item(s)  are  transcribed  into  a 
seeondary  data  strueture — one  which  consists  of  all  the  valid  data 
elements.  If  the  user  responds  “no,”  the  element(s)  are  disregarded  and  the 
user  is  presented  a  new  data  set.  Alternatively,  the  user  eould  be  presented 
an  opportunity  to  access  a  given  data  element  and  manually  enter  the 
eorreet  information,  thereby  ensuring  the  consumption  data  is  consistent 
with  the  original  souree.  The  goal  of  this  approaeh  is  to  proeess  the  entire 
document  without  the  need  to  go  back  through  it  repeatedly.  Although  this 
approach  is  time  consuming,  it  aims  to  ensure  all  the  data  elements  in  the 
doeument  are  analyzed.  Should  the  logie  to  automatically  analyze  the 
document  not  exist,  this  would  be  a  feasible  alternative. 

•  Automated  Analysis.  Under  this  approaeh,  the  end-user  is  involved  in  the 
decision-making  process  at  the  end  after  the  program  has  made  a  “best 
effort”  to  autonomously  extraet  all  eonsumption  data  elements.  This 
method  is  preferred  over  the  walkthrough  analysis  only  if  the  parsing  logie 
is  extensive  and  aecurate.  The  end  goal  of  using  this  method  is  to  save 
time  by  allowing  the  program  to  quiekly  analyze  the  document  and  present 
a  summary  at  the  end.  Once  the  summary  has  been  populated,  the  end-user 
must  then  review  the  output  for  correctness.  Should  the  parsing  logie  be 
subpar,  the  end-user  may  find  it  takes  longer  to  eorreet  the  summary  than 
it  would  be  to  eonduct  the  walkthrough  analysis. 


32 


Determining  the  best  method  to  use — ^walkthrough  or  automated — depends  on  a 
wide  variety  of  factors,  both  personnel  and  software-related.  Answering  these  questions 
may  help  to  address  this  conundrum: 

•  Is  the  OCR  output  file  an  accurate  depiction  of  the  original  consumption 
document? 

•  Is  the  end  user  adequately  trained  to  verify  the  analyzed  data  elements? 

•  Is  the  program  mature  enough  to  handle  automated  analysis? 

•  Is  the  parsing  logic  robust  enough  to  recognize  and  extract  all  of  the 
potential  data  elements? 

Although  both  approaches  are  discussed,  the  automated  program  was  specifically 
designed  to  handle  consumption  documents  reviewed  by  this  thesis.  In  order  to  create  a 
more  robust  automated  application,  further  document  analysis  and  logic  test  creation 
must  occur.  However,  the  walkthrough  analysis  program  could  be  used  to  analyze  any 
input  file  since  it  treats  a  line  of  data  unambiguously.  Figure  21  shows  the  input  values  to 
the  walkthrough  analysis  program.  Figure  22  shows  the  program  and  its  output. 


This  is  not  force  consumption  data. 

eo47l  SQUAD  Dfc'MOUTIO^l  Stl  M032  CHARGE,  DEMO  BLOCK  llB  UMT 
These  are  random  numbers;  1234567890. 

B0471  SQUAD  DEMOLITIOW  SET  M130  CAP,  BLASTING  ELECTRIC 

Figure  2 1 .  Walkthrough  Analysis  Inputs — Consumptiondata.txt 
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a  Thf^ls  a  program  to  conduct  walkthrough  analysis 


inltlalConsumptionOata  =  [] 
finaiConsumptionData  =  [] 
listSrze  =  0 


#  Initial  contents  of  the  input  file 

#  Validated  outputs 

#  Counters  for  program  termination 


finalListSize  =  0 

1  =  0 

#  First,  read  in  all  the  inputs  into  the  Initial  list  data  structure, 
for  line  in  file; 

InltialConsumptjonData.appendlline] 
listSize  =  listSize  +  1 

file.dose()  #  Close  the  Input  file,  no  longer  needed 

fl  Afterwards,  allow  the  end  user  to  validate  data  elements 
userResponse  =  ""  #  A  variable  to  store  the  user's  response 

while  (True): 

print  ("Is  this  a  force  consumption  data  element?") 
pri  nt  ( inltialConsumptionData[i]) 
userResponse  -  ]nput() 

If  (userResponse  —  ''yes’');  tf  Only  keep  valid  Inputs 

finalConsumptionData.append(in1tialConsumptionDaiali]) 
finaiListSiHe  =  f!nalUstSl2e  1 1 

I  -  I  +  1 

if  (i  ==  listSize):  #  Terminate  the  program 

break 

1  =  0 

while  (i  <  finalListSize); 

print  [finalConsumptjonDatap))  #  Print  out  the  valid  outputs 


i  =  l  +  l 

80471  SQUAD  DEMOLITION  SET  M032  CHARGE,  DEMO  BLOCK  ILB  TNT 
60471  SQUAD  DEMOLITION  SET  M130  CAP,  BLASTING  ELECTRIC 


Figure  22.  Program  to  Execute  Walkthrough  Analysis 
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First,  the  program  reads  the  post-OCR  inputs  into  a  list  data  strueture  named 
“initialConsumptionData.”  Once  this  is  complete,  the  input  fde  is  closed.  Afterwards,  a 
loop  control  structure  is  used  to  cycle  through  all  the  data  elements  in  the  list.  As  each 
data  element  is  presented,  the  end-user  must  make  a  “yes”  or  “no”  decision  that  the 
element  is  a  consumption  data  element.  If  the  user  enters  “yes,”  this  data  element  is 
copied  into  the  final  data  structure  named  “fmalConsumptionData”  and  the  process 
continues  until  no  more  elements  exist.  Elements  that  do  not  receive  a  “yes”  decision  are 
skipped  and  the  process  continues  until  no  more  elements  remain  to  be  considered.  At  the 
end,  the  final  list  of  data  elements,  having  been  verified  by  the  end  user,  is  printed  out. 

In  addition  to  the  ability  to  individually  validate  each  element,  logic  can  be 
included  to  allow  the  user  to  correct  each  element  as  necessary,  as  noted  above  for  the 
walkthrough  analysis.  For  example,  the  question  “is  this  element  accurate?”  can  be 
prompted  to  the  user,  allowing  the  user  to  inspect  each  data  element  for  accuracy.  Should 
the  element  be  inaccurate,  the  program  would  then  allow  the  user  to  modify  the  value  of 
the  element  prior  to  placing  it  in  the  output  data  structure.  Although  this  presents  a  line- 
by-line  review  of  the  data  elements,  many  different  variations  can  be  created.  For 
example,  instead  of  showing  one  element  at  a  time  to  the  end-user,  a  page  of  data  can  be 
presented  with  each  line  having  an  associated  index  number.  The  user  could  then  indicate 
which  lines  are  consumption  data  elements  and  they  would  be  copied  over  in  the  same 
manner. 

The  second  approach,  an  automated  analysis  program,  utilizes  the  same 
consumption  data  example  inputs  shown  in  Figure  21.  While  the  goal  is  to  achieve 
automation,  the  program  still  requires  the  end-user  to  review  the  summary  results  created 
by  the  program  prior  to  saving.  Figure  23  illustrates  how  pre-defmed  logic  statements  can 
be  used  to  facilitate  automation  for  most  of  the  program. 
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#This  is  a  prograrnto  conduct  automated  analysis. 


inllialConsumptionDaia  =  |] 
finalConsumptionData  -  [] 
listSize  =  0 
finalListSize  =  0 

i  =  0 


#  Initial  contents  of  the  Input  file 

#  Validated  outputs 

#  Counters  for  program  termination 


#  First,  read  in  all  the  inputs  into  the  initial  list  data  structure, 
for  line  in  file; 

InltlalConsumptlonData.appendillrie) 
listsize  =  listSize  +  1 

file.dose(j  #  Close  the  input  file,  no  longer  needed 

ti  Afterwards,  allow  the  program  to  make  a  decision  based  on  logic  statements. 

searchstring  =  "SQUAD  DEMOLITION  SET"  #  Pre-defined  force  consumption  element 

for  element  in  initialConsumptionData;  ff  Inspect  each  element  in  the  list 
If  searchString  in  element; 

finalConsumptionData. append  (element)  d  Found  a  match,  add  to  output 
finalListSize  =  finalListSize  +  1 

while  (i  <  fina  iListSize) ; 

prfni(finalConsumptionData(iJ)  ^  Print  the  outputs 

1  =  1  +  1 

B0471  SQUAD  DEMOLITION  SET  M03 2  CHARGE,  DEMO  BLOCK  ILfS  TNT 

B0471  SQUAD  demolition  SET  M130  CAP,  BLASTING  ELECTRIC 

Figure  23.  Program  to  Execute  Automated  Analysis 


Similar  to  the  walkthrough  analysis  program,  this  program  begins  by  reading  and 
storing  inputs  into  a  list  data  structure.  One  approach  for  designing  logic  statements  is  to 
create  search  strings.  In  this  example,  the  string  “SQUAD  DEMOLITION  SET”  is  used 
to  represent  a  consumption  data  element  from  the  Class  V(W)  planning  factors  table. 
After  this  has  been  defined,  the  program  scans  through  each  element  of  input  and  does  a 
logical  comparison  between  the  input  and  the  desired  search  string.  If  the  program  is 
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presented  with  a  line  whieh  eontains  the  pre-defmed  seareh  string,  it  plaees  that  particular 
line  into  the  output  listing.  This  test  can  create  false-positives  and  must  be  further 
defined.  This  simple  example  merely  illustrates  how  the  program  can  conduct  its  own 
logical  comparison  of  inputs  without  the  need  for  constant  human  interaction.  It  also 
helps  to  illustrate  how  one  could  approach  logic  creation.  For  example,  one  could  create  a 
consumption  data  dictionary  containing  an  extensive  amount  of  terms  that  can  be 
searched  and  compared  against.  Once  this  dictionary  is  populated,  the  program  could  then 
test  to  see  if  any  of  those  words  existed  in  a  particular  line  of  consumption  information. 
Words  such  as  “table”  and  “factors”  would  be  good  candidates  for  the  dictionary  along 
with  specific  consumption  data  table  names. 

It  is  important  to  note  that  neither  the  walkthrough  nor  the  automated  analysis 
programs  represent  a  panacea  solution.  Due  to  the  infancy  of  such  a  program,  it  may  be 
necessary  to  begin  with  more  of  a  hands-on  application  such  as  the  walkthrough  analysis 
program,  which  evolves  over  time  into  more  of  an  autonomous  solution.  As  the  pre- 
defmed  logic  base  of  the  automated  program  becomes  more  mature,  the  program  will 
produce  more  reliable  results.  Refinement  and  standardization  of  the  input  documents 
will  also  help  to  increase  reliability. 

d.  Export  Analyzed  Data  (3.1  and  3.2) 

Once  the  end-user  has  reached  this  part  of  the  program,  the  user  should  be  in  a 
position  where  they  have  the  refined  consumption  outputs.  These  outputs  should  be 
carefully-reviewed  and  corrected  to  be  consistent  with  the  original  source  document. 
Although  we  use  the  term  “end-user”  to  relate  to  the  user  of  the  application,  we 
wholeheartedly  expect  that  this  data  has  also  gone  through  multiple  levels  of  verification 
(i.e.,  “up  the  chain  of  command”).  Once  this  information  has  been  vetted,  it  can  then  be 
directed  as  input  to  a  database  or  saved  to  a  file. 

(1)  Export  to  Database 

Under  this  approach,  the  program  can  leverage  the  use  of  pre-defined  software 
packages  that  interact  with  database  systems.  For  example,  Python  has  a  module  package 
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named  “_mysql”  that  allows  it  to  interface  with  Oracle’s  relational  database  software — 
MySQL. 

(2)  Export  to  File 

Instead  of  sending  the  outputs  to  a  database,  they  can  be  redirected  to  a  file.  In 
order  to  do  this,  the  program  simply  opens  a  file  in  the  write  (“w”)  mode,  as  previously 
mentioned.  By  saving  to  a  file,  many  different  options  exist  based  on  the  output  value 
format  (e.g.,  .docx,  .txt,  .xml). 

Having  discussed  methods  for  processing  a  file  once  created,  the  next  chapter 
discusses  options  for  OCR  applications  by  which  the  files  themselves  may  be  generated 
from  existing  hardcopy  or  non-character-based  files. 
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IV.  FIELD  DEMONSTRATION,  TESTING,  AND  RESULTS 


A.  OCR  TESTING  AND  COMPARISON 
1.  Testing  Environment 

Pages  were  selected  from  various  consumption  documents  that  illustrated  the 
different  types  of  data  and  layouts  that  are  commonly  encountered  by  the  planner: 

•  Figure  24  illustrates  the  front  page  of  a  Marine  Corps  order.  This  page 
contains  information  that  provides  background  information  and  usage 
procedures  for  the  document.  However,  it  also  contains  information  that 
can  be  used  to  uniquely  identify  the  document  and  can  be  extracted  for  use 
as  the  key  in  a  (key,  value)  pair.  This  example  will  be  referred  to  as  EX-1. 

•  Figure  25  illustrates  a  page  that  contains  a  usage  table.  This  table  is 
presented  in  profde  view  but  must  be  turned  90  degrees  clockwise  to  a 
landscape  view  to  read  it.  This  example  will  be  referred  to  as  EX-2. 

•  Figure  26  illustrates  a  page  that  contains  a  usage  table  in  portrait  view. 
This  table  appears  as  an  image  in  the  input  document.  This  example  will 
be  referred  to  as  EX-3. 

•  Figure  27  illustrates  a  page  that  contains  a  usage  table  in  portrait  view. 
This  table  has  been  previously-created  using  table  formatting  from  another 
text  editor.  This  example  will  be  referred  to  as  EX-4. 

Each  application  was  graded  on  a  scale  of  1-3,  one  for  poor,  two  for  fair,  and 

three  for  good,  respectively.  A  summary  of  these  scores  is  presented  at  the  end  of  this 

section.  The  following  factors  were  compared  and  graded  for  each  application: 

•  Accuracy  Rate.  The  accuracy  rate  is  depicted  as  the  number  of  errors  per 
page.  When  a  data  element  is  not  converted  or  converted  incorrectly,  we 
counted  it  as  a  single  error.  A  data  element  is  defined  as  one  single  entry 
or  word.  For  example,  “B0471  SQUAD  DEMOLITION  SET”  illustrates 
four  data  elements:  B0471,  SQUAD,  DEMOLITION,  and  SET.  Grading 
of  this  criteria  was  based  on  three  thresholds: 

•  0-50  errors  per  page.  Awarded  a  three. 

•  50-100  errors  per  page.  Awarded  a  two. 

•  >100  errors  per  page.  Awarded  a  one. 

•  Consistency.  Consistency  was  tested  to  see  if  the  applications  return 
regular  results  and  whether  or  not  they  encounter  the  same  errors  when 
given  the  same  input  document.  To  test  consistency,  each  page  was 
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scanned  with  each  OCR  application  10  times.  Grading  of  this  criteria  was 
based  on  the  following: 

•  8-10  OCR  attempts  produced  the  same/similar  result.  Awarded  a 
three. 

•  4-7  OCR  attempts  produced  the  same/similar  result.  Awarded  a 
two. 

•  <4  OCR  attempts  produced  the  same/similar  result.  Awarded  a 
one. 

•  Speed.  Speed  was  measured  as  the  amount  of  pages  that  could  be 
converted  per  hour.  Speed  also  takes  into  account  the  time  necessary  for  a 
user  to  make  corrections.  Thus,  speed  is  also  directly-related  to  the 
accuracy  rate.  Grading  of  this  criteria  was  based  on  the  following: 

•  45-60  pages  processed  per  hour.  Awarded  a  three. 

•  30^5  pages  processed  per  hour.  Awarded  a  two. 

•  <30  pages  processed  per  hour.  Awarded  a  one. 

•  Ease  of  Use.  This  relates  to  the  user’s  ability  to  efficiently  use  the 
application.  This  factor  is  based  on  our  use  of  the  program  and  may  not 
accurately  represent  a  novice  user.  Time  to  master  functionality  of  the 
program  guided  the  following  grading  criteria: 

•  <5  minutes  required  to  master  the  program.  Awarded  a  three. 

•  <15  minutes  required  to  master  the  program.  Awarded  a  two. 

•  >15  minutes  required  to  master  the  program.  Awarded  a  one. 

•  Functionality.  This  relates  to  the  application’s  ability  to  provide  useful 
and  helpful  tools  to  the  user — e.g.  spellcheckers,  input  formats,  output 
formats,  etc.  Grading  of  this  criteria  was  based  on  the  following: 

•  The  application  had  numerous  input/output  file  types  and  included 
functionality  that  substantially  increased  end-user  productivity. 
Awarded  a  three. 

•  The  application  had  several  input/output  file  types  and  included 
functionality  that  increased  end-user  productivity.  Awarded  a  two. 

•  The  application  had  limited  input/output  file  types  and  included 
functionality  that  had  a  limited  impact  on  end-user  productivity. 
Awarded  a  one. 
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Tot  Distribution  List 

Elib:  :  CLASS  V{W}  PLANNEHC  FACTORS  FOR  FLEET  HfiRINE  FORCE  COMBAT 

OPERATIONS 


Raft 


fa}  Marina  Corps  Ground  Annnunition  War  Matariel 

Requirement  (WMR)  Determination  {1995-1936}  study 
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fb}  MCO  P4400.39G  (NOTALj 
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(1}  Explanation  of  the  SGenario-Base  CoiuJDat  Planning 
Factors  Tahles 

(2}  Infantry -Heavy  Threat  Combat  Planning  Factors  Table 
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(4}  conposite  oontoat  Planning  Factors  Table 
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1.  Purpose.  TO  promulgate  class  VCW)  combat  planning  factors  (CPF's)  to 
support  Fleet  Marine  Force  fFMF)  corahat  operations. 


2.  Cancel latlo-.  MCO  aoio.LD. 

3.  Background .  Reference  fa)  reports  the  results  of  the  Marine  Corps  Class 
vCw)  HMR  study  (1995-1996) .  Reference  (b)  establishes  Marine  corps  policy 
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distribution 

of  war  reserve  materiel.  References  (c) ,  fd)  and  (e)  provide  logistical 
doctrine  and  associated  tactics,  techniques,  and  procedures  for 
class  vfw)  support  during  combat  operations. 

4.  Planning  Factors.  Factors  to  be  used  during  initial  planning  for  combat 
operations  are  explained  in  enclosure  (1}  and  shown  in  enclosures  £2)  through 
(5)  . 


a.  CPF's  reflect  the  anticipated  expenditure  of  ground  arammltiGn  over 
designated  time  periods  of  combat  operations.  These  rates  represent  the 
unconstrained  requirement.  Unconstrained  requirements  are  based  on  approved 
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anticipated  intensity  of  conflict,  once  version  2 . l  of  the  Ammunition 
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Figure  24.  EX-1,  Marine  Corps  Order  Front  Page  (from  [2]) 
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Figure  25.  EX-2,  Usage  Table  in  Profile  View  (from  [2]) 
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Figure  26.  EX-3,  Usage  Table  as  an  Image  (from  [2]) 
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Figure  27.  EX-4,  Usage  Table  as  a  Table  (from  [2]) 
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2.  OCR  Applications 

The  following  off-the-shelf  programs  were  compared: 

•  http://www.onlineocr.net/.  Free,  open-source,  and  online  OCR 

application.  The  primary  focus  of  this  application  is  to  conduct  OCR. 
Available  in  guest  and  member  modes: 

•  Guest  Mode.  Accepts  PDF  and  image  (JPG,  BMP,  TIFF,  and  GIF) 
with  a  maximum  file  size  of  5  MB.  Output  file  types  are  MS 
Word®  .docx,  MS  Excel®  .xlsx  and  plain  text  .txt.  Conversion  is 
limited  to  15  documents  (single  page)  per  hour. 

•  Member  Mode.  Accepts  the  same  inputs  as  guest  mode.  Output 
file  types  are  PDF,  MS  Excel®  .xls  and  .xlsx,  MS  Word®  .doc  and 
.docx,  plain  text  .txt,  and  a  RTF  document  .rtf  Maximum  input  file 
size  is  increased  from  5  MB  to  100  MB.  New  members  have  a  25- 
page  credit.  Once  this  limit  has  been  reached,  additional  pages 
must  be  purchased.  The  price-per-page  decreases  when  bulk 
amounts  are  purchased.  For  example,  purchasing  50  pages  has  a 
cost  of  10  cents  per  page  ($4.99)  whereas  purchasing  50,000  pages 
has  a  cost  of  0.4  cent  per  page  ($199.95). 

•  Microsoft  OneNote®.  Commercial  software  published  by  the  Microsoft 
Corporation.  Sold  stand-alone  or  as  a  part  of  the  Microsoft  Office  Suite®. 
The  primary  focus  of  this  application  is  to  provide  a  workspace  for  the 
user  to  collect  notes  and  organize  documents.  OCR  is  a  feature  within 
OneNote®.  Accepts  any  input  file  on  the  Windows  OS:  images,  PDF,  MS 
Office®  document  extensions,  text  files,  etc.  Output  file  extensions  are: 
.doc,  .docx,  .txt,  and  .pdf 

•  Nuance  OmniPage®.  Commercial  software  published  by  the  Nuance 
Corporation.  The  primary  focus  of  this  application  is  to  conduct  OCR. 
Accepts  digital  camera  images,  images  (JPG,  BMP,  TIFF,  GIF,  PNG),  and 
PDF.  There  are  over  50  different  output  file  extensions  that  fall  into  eight 
categories:  HTML,  MS  Excel®,  MS  Word®,  MS  PowerPoint®,  PDF, 
RTF,  Unicode  Text,  and  XML. 

Table  1  is  a  summary  of  the  input  file  types  accepted  by  the  applications. 

Likewise,  Table  2  illustrates  the  output  file  types  that  they  are  capable  of  producing. 


45 


onlineocr.net 

Microsoft 

OneNote® 

Nuance 

OmniPage® 

Guest 

Member 

Images 

(.jpg,  .gif,  .tiff,  .png) 

y 

y 

PDF 

(.pdf) 

y 

y 

Text 

(.txt) 

MS  Word® 

(.doc,  .docx) 

MS  Excel® 

(.xls,  .xlsx) 

y 

Digital  Camera 

(Direct  input) 

y 

Totals 

2 

2 

5 

3 

Table  1 .  OCR  Input  File  Types 
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onlineocr.net 

Microsoft 

OneNote® 

Nuance 

OmniPage® 

Guest 

Member 

PDF 

(.pdf) 

y 

Text 

(.txt) 

y 

MS  Word® 

(.doc,  .docx) 

MS  Excel® 

(.xls,  .xlsx) 

y 

y 

MS  PowerPoint® 

(•PPt) 

Rich  Text  Format 

(.rtf) 

y 

HTML 

XML 

Unicode  Text 

y 

Totals 

3 

5 

4 

9 

Table  2.  OCR  Output  File  Types 
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Since  the  primary  focus  of  OCR  is  detecting  and  transcribing  text  from  images,  it 
is  unremarkable  that  each  of  the  applications  accept  image  fdes  as  input.  However,  each 
of  them  accepting  a  PDF  file  is  important  since  this  is  the  format  in  which  the 
consumption  documents  will  most  likely  be  available.  Also,  it  is  important  to  note  that 
OneNote®  is  capable  of  accepting  many  additional  input  fde  types  because  it  conducts  a 
fde-to-image  conversion  of  all  input  documents.  For  example,  when  a  Word®  document 
is  placed  into  OneNote®,  it  represents  the  document  as  an  image  in  the  note.  OCR  must 
then  be  conducted  on  the  image  to  extract  the  text. 

Although  all  of  the  applications  are  capable  of  creating  Word®  and  Excel® 
outputs,  the  text  fde  extension,  .txt,  provides  the  most  flexibility  for  developing  the 
conversion  programs.  Creating  a  program  to  accept  Word®  and  Excel®  files  adds  no 
extra  functionality  and  often  requires  unnecessary  libraries.  Therefore,  this  thesis  will 
create  a  text  fde  with  a  .txt  extension  as  the  output  format  of  the  OCR  applications  for 
later  use  in  the  data  extraction  programs. 

3.  Online,  Open-Source  OCR 

The  first  application  we  tested  was  the  open-source  application  available  at 
http://www.onlineocr.net/.  In  order  to  maximize  the  number  of  documents  that  could  be 
tested,  the  guest  mode  was  utilized.  The  member  mode  offered  no  further  extension  of 
capability  that  would  have  been  beneficial  to  the  discussion.  Figure  28  illustrates  the 
main  page  of  the  application  as  of  8  August  2014. 
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FREE  ONLINE  OCR  SERVICE 

Use  Optical  Character  Recognition  software  online.  Service  supports 
46  languages  including  Chinese,  Japanese  and  Korean 

CONVERT  SCANNED  PDF  TO  WORD 

Extract  text  from  PDF  and  images  (JPG,  BMP,  TEFF,  GIF)  and  convert  into  editable 
Word,  Excel  and  Text  output  formats 


1  STEP  '  Upload  file 


2  STEP  -  Select  language  and  output  format  3  STEP  -  Convert 

ENGLISH  Mfcrosoft  Word  (docx) 


Max  file  size  5  nib. 


1 


Allows  1  file  to  be 
entered  with  a  maximum 
_  size  of  5  MB. 


Offers  support  for  46 
■  languages. 


Hui 


Output  file  types:  .docx, 
.txt,  .xlsx 


Figure  28.  http://www.onlineocr.net:  Main  Page  and  Features  (after  [9]) 


Although  the  application  claims  that  it  can  convert  a  file  up  to  5  MB,  it  is 
important  to  note  that  it  will  only  convert  one  page.  If  the  input  is  a  multi-page  PDF,  it 
will  only  convert  the  first  page.  Therefore,  it  was  presented  with  each  page  until  all  the 
pages  had  been  converted.  The  application  converted  EX-1  with  zero  errors.  This  result  is 
unremarkable  because  the  page  was  previously  prepared  using  text  editing  software  and 
presents  a  very  clear  input  document.  Figure  29  illustrates  the  post-OCR  results  of  EX-1. 
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DEPARTMENT  OF  THE  NAVY 

HEADQUARTERS  UNITED  STATES  MARINE  CORPS 
WASHINGTONr  DC  2003041001 


CORPS  ORDER  SQ1Q.1E 


mo  SOIO.IE 
C  392 
15  Apr  97 


Frcm:  Ccmmandant  of  the  Marine  Corps 

To :  Di 3 1  r ibu  t ion  List 

Subj:  CLAES  V(W)  PIAHNING  FACTORS  FOR  FLEET  MARINE  FORCE  COMBAT 

OPERATIONS 

Ref;  (a)  Marine  Corps  Ground  Ammamit  ion  War  Materiel 

Requirement  (HMR)  Determination  (1995-1996)  Study 
Final  Report  (NOTAL) 

(b) MCO  P4400-3  9G  (NOTAL) 

(c)  FMFM  4-1 

(d)  FM  9-6 

(e)  FM  9-13 

Enel:  (1)  Ei^lanation  of  the  Scenario-Base  Gc-iribat  Planning 

Factors  Tables 

(2)  Infantry- Heavy  Threat  Combat  Planning  Factors  Table 

(3)  Armor -Heavy  Threat  Combat  Planning  Factors  Table 

(4)  Composite  Combat  Planning  Factors  Table 

(5)  Combat  Planning  Factors  for  Special  Operations 

(6)  Artillery  Ancillary  Items 

1-  Purpose-  To  prc-rnioulgate  Class  YIW)  ccarbat  planning  factors  (CPF^s)  to 
support  Fleet  Marine  Force  (IMF)  combat  ope rat i on s- 

2-  Cancellation-  MCO  SO  10. ID. 

3.  Background.  Reference  (a)  reports  the  results  of  the  Marine  Corps  Class 
V(W)  WMR  Study  (1995-1996).  Reference  (b)  establishes  Marine  Corps  policy 
governing  requirements  determdnation ,  acquisition,  management,  and 
distribution 

qZ.  war  reserve  materiel-  References  (c),  (d)  and  (e)  provide  logistical 
doctrine  and  associated  tactics,  techniques,  and  procedures  for 
Class  YIW)  support  during  combat  ope rat i  ons - 

4-  Planning  Factors-  Factors  to  be  used  during  initial  planning  for  combat 
operations  are  explained  in  enclosure  (1)  and  shown  in  enclosures  (2)  through 
(5)  - 


a-  CPF ^3  reflect  the  anticipated  expenditure  of  ground  amimoLinition  over 
designated  time  periods  of  combat  operations.  These  rates  represent  the 
unconstrained  requirement-  Unconstrained  requiremEnts  are  based  on  approved 
force  structure,  weapon  mix,  anticipated  duration  of  combat,  and  the 
anticipated  intensity  of  conflict-  Once  Version  2.1  of  the  Ammunition 
Prepositioning  and  Planning  System  (APPS)  is  fielded,  the  APPS 

DISTRIBUTION  STATEMENT  A:  Approved  for  public  release;  distribution  is 
unlimited- 


Figure  29.  Post-OCR  Results  of  EX-1  using  onlineocr.net  (after  [2]) 

EX-2  was  tested  next.  Figure  30  illustrates  the  post-OCR  results  with  errors 
highlighted  in  red. 
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Figure  30.  Post-OCR  Results  of  EX-2  using  onlineocr.net  (after  [2]) 


When  OCR  was  conducted  on  EX-2,  28  errors  were  encountered: 

•  Five  errors  occurred  in  the  conversion  of  text. 

•  23  errors  occurred  in  the  conversion  of  numbers. 

•  Of  the  23  errors  that  involved  numbers,  16  of  those  errors  occurred  in  the 
first  column,  “Weapon  ID.” 

The  reason  for  such  a  high  error  rate  in  the  “Weapon  ID”  column  can  be 
attributed  to  the  applications  inability  to  distinguish  the  difference  between  the  letter  “B” 
and  the  numbers  “6”  and  “8.”  This  was  most  likely  caused  by  the  fact  that  the  letter  “B” 
has  rounded  corners  and  appears  fiizzy  in  the  input  document.  This  is  normal  and  an 
expected  degradation  of  a  document  whose  original  publish  date  was  15  April  1997. 
Also,  the  enclosures  may  have  been  adopted  from  a  document  that  was  created  before  the 
source  document  came  into  existence.  An  important  finding  is  that  the  online  OCR 
application  was  intelligent  enough  to  determine  that  the  data,  although  presented  in  a 

vertical  fashion,  was  best  suited  for  representation  as  a  table  in  a  horizontal  view.  Thus, 
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the  application  took  the  page  that  was  originally  presented  in  a  profile  view  and  presented 
its  output  as  a  file  in  landscape  view. 

EX-3  was  tested  next.  Figure  31  illustrates  the  post-OCR  results  with  errors 
highlighted  in  red. 


TABLE  Z.  COMBAT  PLANNING  FACTORS  FOR  SPECIAL  OPERATIONS 
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Figure  3 1 .  Post-OCR  Results  of  EX-3  using  onlineocr.net  (after  [2]) 


When  OCR  was  conducted  on  EX-3,  37  errors  were  encountered: 

•  33  errors  occurred  because  the  application  omitted  three  entire  lines  of 

data. 
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•  Two  errors  occurred  in  the  conversion  of  text. 

•  Two  errors  occurred  in  the  conversion  of  numbers. 

An  important  finding  is  that  the  OCR  application  did  not  place  the  data  elements 
into  a  table.  The  application  interpreted  the  page  contents  as  an  image,  rather  than  table. 
However,  an  appropriate  amount  of  space  was  placed  between  the  data  elements  for 
readability.  The  cause  for  the  omission  of  three  lines  of  data  is  unknown. 

EX-4  was  tested  next.  Figure  32  illustrates  post-OCR  results  with  errors  in  red. 
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Figure  32.  Post-OCR  Results  of  EX-4  using  onlineoer.net  (after  [2]) 
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When  OCR  was  conducted  on  EX-4,  12  errors  were  encountered: 

•  Five  errors  occurred  in  the  conversion  of  text. 

•  Seven  errors  occurred  in  the  conversion  of  special  characters. 

The  application  was  unable  to  determine  the  difference  between  the  letter  “D”  and 
the  number  “0”  due  to  degradation  of  the  source  document.  The  application  was  unable  to 
determine  the  difference  between  the  forward  slash  character  “/”  and  the  letter  “1.”  An 
important  finding  is  the  fact  that  the  application  recreated  the  table  structure  from  EX-4 
near-perfectly. 

Table  3  illustrates  a  summary  of  the  accuracy  rate  and  important  findings  for  the 
open-source  application. 


Page 

Total 

Errors 

Word 

Errors 

Number 

Errors 

Cause  /  Findings 

EX-1 

0 

0 

0 

•  100%  accuracy  rate  of  conversion  may  be 

attributed  to  the  previous  use  of  text 
editing  software. 

EX-2 

28 

5 

23 

•  Problems  distinguishing  the  letter  “B” 
from  the  numbers  “6”  and  “8.” 

•  Text  converted  from  portrait  view  to 
landscape. 

•  Data  placed  in  a  table  data  structure. 

EX-3 

37 

5 

32 

•  Text  interpreted  as  an  image  rather  than  a 
table. 

EX-4 

12 

8 

4 

•  Near-perfect  table  recreation 

•  Problems  distinguishing  the  letter  “D” 
from  the  number  “0.” 

•  Problems  distinguishing  the  special 
character,  forward  slash  “/”  from  the  letter 

6CJ  99 

Totals 

77 

18 

59 

Table  3.  Online,  Open-Source  OCR  Results  and  Findings 
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Overall,  the  online  and  open-source  OCR  application  demonstrated  the  ability  to 
convert  the  original  consumption  documents  into  useful  output  fdes.  It  also  had  the 
ability  to  recreate  tables  and  recognize  when  data  is  best  presented  in  other  formats.  For 
example,  converting  the  data  contained  in  EX-2  from  a  vertical  profile  view  to  a 
horizontal  landscape  view  is  helpful.  Based  on  the  findings  for  this  application,  the 
following  scores  were  given: 

•  Accuracy:  2 

•  Consistency:  3 

•  Speed:  1 

•  Ease  of  Use:  3 

•  Functionality:  1 

The  accuracy  rate  of  the  application  was  manageable  and  consistent.  The 
application  suffers  in  speed,  limiting  the  user  to  15  pages  per  hour  which  would  be 
mitigated  by  using  the  member  mode  albeit  while  impacting  the  cost.  Using  and 
understanding  the  functionality  of  the  program  can  be  accomplished  in  five  minutes.  The 
application  provides  the  least  amount  of  input  and  output  formats  and  has  no 
spellchecking  ability.  Once  OCR  has  been  completed,  the  user  must  open  the  output  file 
in  a  text  editor  to  review  and  correct  its  contents. 

4.  Microsoft  OneNote®  OCR 

Microsoft  OneNote®  was  tested  next.  Figure  33  illustrates  the  post-OCR  results 
ofEX-1. 


DE  PARTMErJT  O  F  THE  INI AVY 
HEADQUARTERS  UNITED  STATES  MARINE  CORPS 
WAS  H  IN  GTON ,  DC  20380-0001 

Figure  33.  Post-OCR  Results  of  EX-1  using  OneNote®  (after  [2]) 

Although  OneNote®  was  given  the  same  page  that  the  online  and  open-source 
was  given,  it  was  unable  to  OCR  the  document  past  the  first  three  lines.  This  is 
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remarkable  beeause  it  represents  the  first  major  failure  of  an  OCR  application  to 
successfully  convert  an  input  document.  To  explore  the  significance  of  different  file 
formats,  the  input  document  was  converted  from  PDF  to  a  JPG.  Figure  34  illustrates  the 
second  round  of  OCR  testing  conducted  on  EX-1  in  an  image  format. 
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IS )  Co  n  b  St  PI  a  n  n  i  ng  F  a  ct:ors  { o  r  S  p  e  ci  a  1 0  p  e  rat  i  o  n  s 
161  ArtilleryAnclllaryttens 

Figure  34.  Post-OCR  Results  of  EX-1  using  OneNote®:  Second  Pass  (after  [2]) 


Without  listing  the  entire  contents  of  the  output,  we  can  quickly  see  that  the 
output  is  highly  inaccurate  both  in  spelling  and  format.  Thus,  in  order  to  make  an 
accurate  output  file,  the  post-OCR  results  would  need  to  be  heavily  corrected.  OneNote® 
has  the  built-in  functionality  of  a  spellchecker.  This  can  be  leveraged  to  correct  the  errors 
and  alleviate  some  of  the  burden  on  the  user,  however,  the  user  must  select  this  option 
since  it  is  does  not  turn  on  automatically  after  OCR  is  complete. 

EX-2  was  tested  next.  Figure  35  illustrates  the  post-OCR  results. 
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Figure  35.  Post-OCR  Results  of  EX-2  using  OneNote®  (after  [2]) 

When  OCR  was  conducted  on  EX-2,  OneNote®  was  unable  to  accurately  process 
the  input.  Although  the  document  was  converted  to  an  image  and  placed  back  in 
OneNote®  in  a  landscape  view  as  a  secondary  test,  the  same  result  was  encountered. 
Thus,  heavy  modification  or  reformatting  of  the  input  document  would  be  necessary  to 
properly  process  the  document. 

EX-3  was  tested  next.  Figure  36  illustrates  the  post-OCR  results. 
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Figure  36.  Post-OCR  Results  of  EX-3  using  OneNote®  (after  [2]) 


When  OCR  was  eondueted  on  EX-3,  47  errors  were  eneountered: 

•  32  errors  oeeurred  in  the  eonversion  of  numbers. 

•  14  errors  oeeurred  in  the  eonversion  of  text. 

•  One  error  oeeurred  in  the  eonversion  of  the  speeial  eharaeters. 

The  majority  of  the  errors  oeeurred  in  the  first  eolumn  where  the  DODIC  is 
represented  by  alphanumerie  eharaeters.  The  seeond  region  that  eneountered  the  most 
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problems  also  involved  alphanumerie  eharaeters.  An  important  finding  is  that  OneNote® 
provided  very  little  in  the  way  of  formatting.  The  original  input  had  large  areas  of  white 
spaee  to  provide  readability  whereas  OneNote®  left-aligned  the  majority  of  the  document 
and  removed  this  white  space,  making  the  output  difficult  to  read  for  an  end-user. 

EX-4  was  tested  next.  Figure  37  illustrates  the  post-OCR  results. 
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Figure  37.  Post-OCR  Results  of  FX-4  using  OneNote®  (after  [2]) 


When  OCR  was  conducted  on  FX-4,  two  errors  were  encountered  that  involved 
the  conversion  of  text.  Although  the  input  had  been  previously  prepared  using  a  text 
editor  and  was  presented  in  a  highly  structured  table  format,  OneNote®  failed  to  recreate 
the  table  and  present  the  data  in  a  useful  format.  Table  4  illustrates  a  summary  of  the 
accuracy  rate  and  important  findings  for  OneNote®. 
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Page 

Total 

Errors 

Word 

Errors 

Number 

Errors 

Cause  /  Findings 

EX-1 

>  100 

>  100 

>  100 

•  Unable  to  convert  the  input  document  past 
the  first  three  lines  of  data. 

•  Native  spellchecking  capability 
discovered. 

EX-2 

>  100 

>  100 

>  100 

•  Complete  failure  to  detect  input  layout 

EX-3 

47 

14 

32 

•  Words  comprised  of  alphanumeric 
characters  represented  46  out  of  47  errors. 

•  Problems  distinguishing  the  special 
character,  forward  slash  “/”  from  the 
special  character  exclamation  point  “!.” 

EX-4 

2 

2 

0 

•  Failure  to  recreate  table  for  readability. 

Totals 

>  100 

>  100 

>  100 

Table  4.  OneNote®  Results  and  Findings 


In  summary,  OneNote®  was  incapable  of  accurately  conducting  OCR  on  EX-1, 
EX-2,  and  EX-4.  OneNote®  comes  with  the  functionality  of  spellcheck  but  does  not 
provide  the  functionality  to  apply  formatting  to  the  OCR  output.  While  the  program 
offers  unlimited  OCR  capability,  it  must  be  purchased  in  order  to  do  so.  The  main  benefit 
of  the  program  remains  focused  on  its  ability  to  quickly  and  efficiently  take  notes  and 
requires  its  secondary  OCR  feature  to  be  further  refined  before  wide-scale  use  as  a 
reliable  OCR  application.  Based  on  the  findings  for  this  application,  the  following  scores 
were  given: 

•  Accuracy:  1 

•  Consistency:  1 

•  Speed: 1 

•  Ease  of  Use:  2 

•  Functionality:  2 

The  accuracy  rate  of  the  application  was  poor  and  inconsistent.  The  application  is 
capable  of  conducting  OCR  quickly;  however,  overall  speed  suffers  based  on  the  amount 
of  errors  that  need  to  be  corrected  by  the  user.  The  built-in  spellchecker  functionality  can 
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aid  the  user  in  correeting  these  errors  and  offsets  some  of  the  speed  penalties.  Using  and 
understanding  the  functionality  of  the  program  can  be  accomplished  in  10-15  minutes. 
The  application  provides  the  most  input  file  types  but  has  limited  output  options. 

5.  Nuance  OmniPage®  OCR 

Nuance  OmniPage®  was  the  last  application  tested.  When  the  software  loads,  the 
user  is  presented  with  a  menu  to  choose  what  kind  of  conversion  they  would  like  to 
accomplish.  Figure  38  illustrates  this  screen. 


Figure  38.  OmniPage  Start  Screen 

While  there  are  templates,  known  as  “workflows,”  the  method  used  by  this  thesis 

was  the  “open  fde”  option.  Once  this  option  is  clicked,  a  dialog  box  is  presented  that 

allows  the  user  to  select  what  documents  they  would  like  to  OCR.  An  important  finding 

is  that  the  application  allows  the  user  to  select  multiple  input  documents  of  various 

formats  at  one  time.  For  example,  you  can  select  a  PDF,  an  image,  and  another  PDF  all  at 
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the  same  time.  Likewise,  you  ean  seleet  a  PDF  and  it  will  import  all  pages  of  the  PDF. 
After  the  files  have  been  seleeted  and  imported  into  the  program,  the  user  is  presented 
with  a  workspace  view.  The  workspace  view  has  multiple  frames  and  allows  the  user  to 
rearrange  their  frames  as  they  please.  This  view  is  presented  in  Figure  39. 


Figure  39.  OmniPage®  Workspace 


The  three  main  frames  of  the  application  are  a  thumbnail  screen  (left),  a  page 
image  screen  (middle),  and  a  text  editor  screen  (right).  Before  OCR  has  been  conducted, 
only  the  thumbnail  screen  and  page  image  screen  contain  information.  In  order  to  start 
processing  the  document,  the  user  must  click  a  button  aptly  named  “Start  Processing.” 
Once  the  button  is  clicked,  OCR  is  performed  on  the  input  documents,  the  text  editor 
screen  is  populated  to  show  the  output,  and  a  proofreading  screen  appears  to  walkthrough 
the  document.  Figure  40  represents  the  state  of  the  application  once  the  “start  processing” 
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button  has  been  clicked.  For  clarity,  Figure  41  illustrates  the  proofreading  screen 
separately. 


Untitled  OmniPage  Document  1  -  OmniPage 
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Figure  40.  OmniPage®  Workspace  post-OCR 
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F igure  4 1 .  OmniPage®  Proofreader 

The  proofreader  screen  allows  the  user  to  walkthrough  the  document  to  verify  two 
cases:  spelling  and  suspected  inaccuracy.  If  the  program  encounters  a  word  not  in  its 
dictionary  and  is  considered  a  misspelling,  “MCO”  for  example,  it  allows  the  user  to 
change  it  or  add  it  to  the  dictionary.  Adding  the  word  to  the  dictionary  eliminates  the 
need  to  correct  the  word  later  in  the  document  or  in  future  documents.  If  the  program 
encounters  a  word  that  it  suspects  is  an  inaccurate  transcription,  it  will  prompt  the  user  to 
enter  the  correct  entry.  Both  of  these  cases  are  handled  at  the  same  time  on  a  line-by-line 
basis.  To  aid  the  user,  the  application  shows  the  original  entry  in  the  top  box  and  allows 
the  user  to  retype  the  information  in  the  middle  box.  Once  the  user  has  reviewed  and 
corrected  the  OCR  output,  he  must  click  the  button  “save  to  files.”  This  opens  a  dialog 
box  and  prompts  the  user  to  enter  a  filename  and  desired  output  file  type.  Figures  42 
through  45  illustrate  the  OCR  output  produced  by  OmniPage®. 
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3 .  E ac kc round .  Reference  (a^  reports  the  results  of  the  Marine  Corps  Class 
WMR  Stud^^^  ( 1 335-1 33 ■£> .  Ref  erence  (b)  establishes  Marine  Corps  policy 

governing  reguireirients  dete nr.in ation ^  acquisition^  iranagementr  and 
distr  ihu  ti  on 

Cl£  war  reserve  rateriel.  References  (c;  ^  (d^  and  (e)  provide  logistical 

doctrine  and  associated  tactics,  techniques  ^  and  procedures  for 
Class  KiW>  support  during  coirhat  operations. 

4 .  Planning  Factors.  Factors  to  be  used  during  initial  plannuig  for  coirhat 
operations  are  explained  in  enclosure  (1)  and  shown  in  enclosures  C2>  through 
(S>  . 


a.  CPF's  reflect  the  anticipated  expenditure  of  ground  aiminition  over 
designated  time  periods  of  combat  operations.  These  rates  represent  the 
unconstrained  requirem.ent.  Unconstrained  requirements  are  based  on  approved 
force  structure,  weapon  mix,  anticipated  duration  of  combat,  and  the 
anticipated  intensity  of  conflict.  Once  Version  2.1  of  the  Aimiunition 
Prepositioning  and  Planning  Si^stem.  (APPSJ  is  fielded^  the  APPS 

DISTRIBUTION  STATEMENT  A:  Approved  for  public  release  ?  distribution  is 
uni  imited. 


Figure  42.  Post-OCR  Results  of  EX-1  using  OmniPage®  (after  [2]) 
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lnfai'D“s'+'lsafb>'  TTrcsK  OoTiaff:  Ranrtna  -ac:as  TaUla 
'rt'caaon  ID  SsqjtToc 


in  fjQirnendabjnt 


jajmrnLriUon 


Worn  end  B&jnt 


GCE  EflTES 

D^ly  Daliy  Ease  Dolly 

jaSSa.LLT  5LST,PJM  Wlowanoe  ^SSALLT 


3sc 

Ak3i‘<OIE 


saca  SO'JiaD  DBMOLmDIM  EET 

SJji71  Sqi-MSD  DBHOmiONI  SET 

B(K71  SqiJAD  DeHOlXTlON  SET 

=0±71  SqUSD  DBMOLXnOlM  SET 

3CJ171  SqLMy-i  DBHOUTION  SET 

EJji71  SqUflD  OBHOmiOIM  SET 

3371  SqiJflD  DBHOLITIDIM  SET 

3CKa  sqiJiaD  DeMoi-niDNi  set 


MOHI  CH£i.RjGE  .  36HO  E4.0QC  1  LE-  im 

M 131  CflP..  SLEiETlftfl  EL3CT=UC 

Mm  CPP..  EUiETlJMG  NDIM-B_BCTH1C 

M4SS  COFID,  DETD™tIi^  FOTM 

MSTQ  rLCE,  EiJiETlI^G  TIME 

M7S7  G-ICI.ROE.  ^ESSMS-LT  C»10l_m0N 

M733  IQ^TEl.  TIME  F-GE  BljaETL'^ 

HIE  RflJMl  DE^-,  3=HO  MULURjaROSE 


15.317=3 

lE.aiBiS 

aa.GGGGa 

1431.3>iE3 

531.33310 
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31.33031 

QA7S73 
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3E.OOOOO 
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31E.33031 

3.39335 

Si.OOOOl 

0.33131 


15.31753 
H3  13.01945 
390  90.00000 

1900  1401.30493 
3300  390.00000 

SO  15.00753 

390  90.00031 

31  0.47973 


3.39905  15 

3.03004  ISO 

39.00000  100 

340.45345  1900 

319.00000  1500 

3.39905  50 

39.00000  101 

0.30031  IZ 


aji±73  DeHomiDN  Bqi-imivT  individual 

03473  DffHOLXTlDN  BqUimWT  INDIVIDLIAL 

3073  DeHOmiDN  BqiJimiMT  INOaVlDJAL 


Ma33  CHA.9GE.  DGMO  ELOCsC  1  LE.  WT 

Mm  CAP,  SLAETING  PlSONHa-BCTFUC 

M4,59  ODRD,  DETOmUMl  F01M 


30.00030  5111300 

311.55995  43.930E 


3.99134  G.  1^344 

30  30.00000  5.00000 

sen  311.55995  -iB.EOOOa 


Eil472  OBMOUTIDN  BqiJlFMIMT  INDl'^OEUAL 

30473  De^OmiDN  BqLiimNT  INDl'^lDUAL 

9172  DBHOUmON  BqLUJRHNT  INOa'^ODUAL 

EJ1472  D&IOLraDN  BqLUJRHNT  INDl'^ODUAL 


M970  =_CZE.  E-_flETl!VG  TIMS 

M757  GHA.9JGE.  AESEMELT  DBHOaUDN 

MIEO  TIME  R-GE  BU£i=Tl'45 

MLj03  FlfUMl  Oe^-,  MUL1ia_RROS£ 


130.00030 

3.11537 

31.00301 

1.05755 


30.00001 

0.40313 

7.00001 

■1.43333 


490  130.00001 

4  3.11537 

30  31.00001 

12  1.05755 


31.100001 

0.40310 

7.00001 

0.4^09 


31 

501 

500 

4 

50 


EJ1509  COMBAT  MOBIUTV  '^^H-IICLE 

30^  COMBAT  MOBILITY  VH-OCLE 

30509  OOMBAT  MOSIUTY  '^B-OO-E 


Am  CTG,  7.93MM  4  i  1  UNKHD 

9543  CTG.  40MM  HEIP  LJN<HD  PORMKIE 

Q325  OREI^.^,  LAUNO-m  SMOKE  IR 


39. 10943  3044459  1001 

159S71  ll.(E704  334 

ZTEBSl  1.31975  ^ 


E 1293  UfE  OHG  LA.ijrJCH  KTT  TRLR  MTD 
ElZE  l_lJr«  GHG  LAJUMCH  KH  TRLR  MTD 


1143  ROCKET  MQTDR,  5  INCH 

Mgi3  CHARGE,  OBMO  UNEAR  HE 


0.7OCE5 

0.70035 


0.15379  3 

0.15379  3 


B1915  lUftE  CHG  U:i.UNCHHl  Ff  AA.'^fl?7A.l 

BI315  LINE  Q-G  LA.LrJCHBl  Ff  AA'yF7Al 

BllSO  ERlDtS  ARMORED  LALINOBI 

80307  COMMAND  LALNCH  UNIT,  JA.VOJN 

31311  DSEUGNA-THl  MARKSMAN  RIFLE 


1143  ROCKET  MOTOR,  5  INCH 

MLZ5  OHARjGE,  DBHO  UJ^H  HE  LVT 

G339  GRErva-DE,  LA.LiNC>m  SMOKE  IR 

lA.'^  lA.'^aJN 

AM  1  CTG,  7.9ZMM  EALL  MATCH 


3.39044  0.47001  3 

3.39044  0.47001  3 

3. 19475  1.30531  34 

0.31395  0.04515  3 

3.97150  31.03173  lOJ 


Figure  43.  Post-OCR  Results  of  EX-2  using  OmniPage®  (after  [2]) 
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OOWTOaOTE 


MCO  .8010.  IE 


TABLE  2.  COMBAT  PLANNING  FACTORS  FOR  SPECIAL  OPERATIONS 


QUANTITY  PER 

DODIC  NOMENCLATURE _ MEUtSOCi 


AOll 

CTG, 

12  GA  00  BUCR 

4S0 

A014 

CTG, 

12  GA  7.5  SHOT 

4S0 

A023 

CTG, 

12  GA  SLUG 

250 

A024 

CTG, 

12  GA  LOCRBUSTER 

300 

A136 

CTG, 

7.62MM  MATCH 

460 

A260 

CTG, 

9MM  OHP 

IS, 000 

A363 

CTG, 

9MM  BALL 

18,000 

A475 

CTG, 

.  4  0  v.fAL  bat  iT  I 

6,000 

AX14 

SHOTGUN  PRIMER 

1,000 

DtTOS 

DIVERSIONARY  CHARGE  MR  141  MOD  0 

500 

L302 

SIG 

CTG,  WHITE  FLARE 

350 

L304 

SIG 

CTG,  GREEN  FLARE 

350 

L32S 

SIG 

CTG,  RED  FLARE 

350 

LKll 

SIGNAL  ROCRET  LAUNCHER  RIT 

50 

M031 

CHG, 

DEMO,  BLOCR,  TNT  1/2  LB 

192 

M032 

CHG, 

DEMO,  BIOCR,  TNT  1  LB 

96 

M039 

CHG, 

DEMO,  CRATERING 

20 

M130 

CAP, 

BLAST  ELEC 

■joo 

M131 

CAP, 

BLAST  NON- ELEC 

4S0 

M456 

CORD 

,  DETONATING,  REIN 

2,000  FT 

M670 

FUZE 

,  BLASTING,  TIME,  EXPLOSIVE  LOADED 

SOO  FT 

M766 

IGNITER,  TIME.  BLASTING  FUZE 

300 

M960 

CHG, 

DEMO  ,  EXPLOSm:  SHEET  3S  FT  PER  ROLL 

2  RO 

M9S1 

CHG, 

DEMO,  EXPLOSIVE  SHEET  25  FT  PER  ROLL 

2  RO 

M9S2 

CHG, 

DEMO,  EXPLOSIVE  SHEET  19  FT  PER  ROLL 

2  RO 

ML03 

FIRING  DEVICE,  DEMO,  MULTIPURPOSE  M124 

112 

MM30 

CHG, 

DEMO  20  ^^AMS 

200 

MM41 

CHG, 

DEMO,  SHAPED,  FLEX  LIN  30  GR/FT,  6  FT  : 

LEN'GIH 

12 

MM42 

CHG, 

DEMO,  SHAPED,  FLEX  LIN  40  GR/FT,  €  FT  : 

LEN'GTH 

32 

MM43 

CHG, 

DEMO,  SHAPED,  FLEX  LIN  60  GR/ET,  6  FT  : 

LEN'GTH 

12 

MM44 

CHG, 

DEMO,  SHAPED,  FLEX  LIN  75  GR/ET,  6  FT  : 

LENCTH 

2S 

MM45 

CHG, 

DEMO,  SHAPED,  FLEX  LIN  100  GR/FT,  6  FT 

LENCTH 

12 

MM46 

CHG, 

DEMO,  SHAPED,  FLEX  LIN  225  GR/FT,  6  FT 

LEN-oTH 

12 

MM47 

CHG, 

DEMO,  SHAPED,  FLEX  LIN  400  GR/FT,  6  FT 

LENGTH 

12 

MM4S 

CHG, 

DEMO,  SHAPED,  FLEX  LIN  600  GR/FT,  6  FT 

LENCTH 

12 

MM5  6  DUAL  LEAD  NONEL  PRIMADET,  175  MS  DELAY,  10  0  FT  LENGTH  7  5 
MN14  FIRING  DEVICE  HAND  HELD  MR  54  5 


ENCLOSURE  [5} 

Figure  44.  Post-OCR  Results  of  EX-3  using  OmniPage®  (after  [2]) 
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MCO  eOlD.IE 


ARTILLERY  ANCILLARY  ITEMS 


APR  1 5  1997 


PHCI.I  FCTM  F 

iLur'ii  1  anv  itpu 

IXJDIC 

Nome  n-c  latu  re 

DO  Die 

Nomenclature 

Mulitplier 

PROJ  155HH,  ADAH-L,  H692 

D533 

CHG  PROP  155HH,  RaWB,  M119A1; 

0.27 

0540 

CHG  PROP  155HH,  GREEN  BAG,  H3 

0.15 

D541 

CHG  PROP  155HH,  WHITE  BAG,  M4 

0.6S 

N23S 

FUZE,  ET,  M762 

1.05 

(N2S5,  FZ,  MTSq,  H577  may  Bubstitute) 

N523 

PRIMER,  PERCUSSION,  M82 

1.1 

D502 

PROJ  155HH,  ADAH-S,  H731 

D533 

CHG  PROP  155MM,  RBjWB,  M119A1; 

0.27 

D540 

CHG  PROP  155MM,  GREEN  BAG,  M3 

0.15 

0541 

CHG  PROP  155MM,  WHITE  BAG,  M4 

o.es 

N239 

FUZE,  ET,  M762 

(N2S5,  FZ,  MTSq,  M577  may  Bubstitute) 

1.05 

N523 

PRIMER,  PERCUSSION,  M82 

1.1 

0505 

PROJ  155MM,  IILLUM,  M435A2 

0533 

CHG  PROP  155MM,  RB^WB,  M11SA1I 

0.27 

0540 

CHG  PROP  155MM,  GREEN  BAG,  M3 

0.15 

0541 

CHG  PROP  155MM,  WHITE  BAG,  M4 

0.03 

N2SS 

FUZE,  ET,  M762 

(N2S5,  FZ,  MTSq,  M577  may  substitute) 
(N24S,  FZ,  MT,  M565may  substitute) 

1.05 

N523 

PRIMER,  PERCUSSION,  MS2 

11 

D510 

PROJ  155MM,  COPPERHEAD,  H712 

0533 

CHG  PROP  155MM,  RB/WB,  M119A1; 

0.27 

0540 

CHG  PROP  155MM,  GREEN  BAG,  M3 

0.15 

D541 

CHG  PROP  155MM,  WHITE  BAG,  M4 

0.03 

N523 

PRIMER,  PERCUSSION,  Me2 

1.1 

D514 

PROJ  155HH,  RAAH-L 

D533 

CHG  PROP  155MM,  RB'WB,  M119A1; 

0.27 

D540 

CHG  PROP  155MM,  GREEN  BAG,  M3 

0.15 

D541 

CHG  PROP  155MM,  WHITE  BAG,  M4 

0.03 

N239 

FUZE,  ET,  M762 

(N2S5,  FZ  MTSq,  M577  may  substitute) 

1.05 

N523 

PRIMER.  PERCUSSION.  M82 

1.1 

D515 

PROJ  155HH,  RAAH-S 

D533 

CHG  PROF  155MM,  RB'WB,  M119A1; 

0.27 

D540 

CHG  PROP  155MM,  GREEN  BAG,  M3 

0.15 

D541 

CHG  PROP  155MM,  WHITE  BAG,  M4 

0.03 

N2S9 

FUZE,  ET,  M762 

(N2S5,  FZ,  MTSq,  M577  may  substitute) 

1.05 

N523 

PRIMER,  PERCUSSION,  MB2 

11 

D52S 

PROJ  155HH,  SMOKE,  WP,  MB25 

0532 

CHG  PROP  155MM,  RED  BAG,  M203 

0  2 

D533 

CHG  PROP  155MM,  RB/WB,  M119A1; 

0.2 

D541 

CHG  PROP  155MM,  WHITE  BAG,  M4 

0.7 

N2S9 

FUZE,  ET,  M762 

(N2S5,  FZ,  MTSq,  M577  may  substitute) 

1.05 

N523 

PRIMER.  PERCUSSION.  MB2 

1  1 

ENCLOSURE  (6) 

Figure  45.  Post-OCR  Results  of  EX-4  using  OmniPage®  (after  [2]) 


Table  5  illustrates  a  summary  of  the  aeeuracy  rate  and  important  findings  for 
OmniPage®. 
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Page 

Total 

Errors 

Word 

Errors 

Number 

Errors 

Cause  /  Findings 

EX-1 

0 

0 

0 

•  Recreated  near-perfectly. 

EX-2 

14 

7 

7 

•  Text  converted  from  portrait  view  to 
landscape. 

•  Data  placed  in  a  table  data  structure. 

EX-3 

4 

3 

1 

•  Data  placed  in  a  table  data  structure. 

EX-4 

5 

2 

3 

•  Near-perfect  table  recreation. 

Totals 

23 

12 

11 

Table  5.  OmniPage®  Results  and  Findings 


In  general,  OmniPage®  aeeurately  transeribed  eaeh  input  file.  In  most  cases,  the 
application  created  fully  modifiable  outputs  that  were  near-duplicates  of  the  input  files. 
While  the  program  did  encounter  errors,  no  specific  trends  appeared.  When  an  error  was 
encountered,  it  was  corrected  with  the  proofreader.  The  application  was  able  to  detect  and 
represent  data  in  different  views — landscape  and  portrait,  and  also  created  very  accurate 
and  defined  tables.  Based  on  the  findings  for  this  application,  the  following  scores  were 
given: 

•  Accuracy:  3 

•  Consistency:  3 

•  Speed:  3 

•  EaseofUse:! 

•  Functionality:  3 

OmniPage®  had  the  highest  accuracy  rate  of  all  the  applications.  It  consistently 
produced  the  same  results.  The  applications  OCR  speed  and  proofreader  allow  the  user  to 
quickly  review  and  correct  the  document.  Using  and  understanding  the  basic  functionality 
of  the  program  can  be  accomplished  in  approximately  one  hour.  Understanding  the 
advanced  functionality  of  the  program  can  be  accomplished  in  2-3  hours.  The  application 
has  fewer  input  file  types  when  compared  to  OneNote®,  however,  it  has  the  most  output 
types  of  all  three  applications. 
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6.  OCR  Summary 

Table  6  illustrates  a  summary  of  the  scores  given  to  all  the  applications. 


Accuracy 

Consistency 

Speed 

Ease  of 

Use 

Functionality 

Totals 

Onlineocr.net 

2 

3 

1 

3 

1 

10 

OneNote® 

1 

1 

1 

2 

2 

7 

OmniPage® 

3 

3 

3 

1 

3 

13 

Table  6.  OCR  Summary  Scores 


In  general,  the  OmniPage®  software  out-performed  the  other  applications  and 
received  the  highest  score.  Not  only  did  it  have  the  highest  accuracy  rate,  it  provided  the 
most  functionality — spellcheck,  native  text  editor,  and  the  most  output  formats.  Based  on 
these  findings,  OmniPage®  was  used  to  create  the  text  files  that  were  later  used  by  the 
extraction  programs. 

The  second-best  application  was  the  open-source  application.  Although  the 
application  proved  to  be  accurate,  intelligent,  and  quickly  leamable,  it  is  limited  by  its 
page-per-hour  restriction  and  number  of  input  and  output  formats. 

OneNote®  was  least-favored  because  it  produced  highly  inaccurate  and 
inconsistent  results.  While  the  program  provides  spellcheck  functionality  and  a  vast  array 
of  input  formats,  it  has  limited  output  options  and  requires  heavy  user-involvement  to 
correct  the  OCR  outputs. 

B.  PROGRAM  DEMONSTRATION 

After  the  OCR  comparison  was  conducted,  two  programs  were  created  to  extract 
the  text-based  consumption  data  text  from  the  text  files  produced  by  OmniPage®.  The 
goal  of  the  first  program  was  to  automate  the  process  of  data  extraction  by  using  pre¬ 
defined  decision-making  logic.  While  the  program  strives  for  automation  during  the  text 
extraction  phase,  user  interaction  is  necessary  at  the  end  to  verify  the  outputs.  The  goal  of 
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the  second  program  was  to  involve  the  user  in  every  decision.  Since  it  had  no  pre-defmed 
logic  statements,  the  responsibility  for  deciding  whether  a  consumption  data  element  was 
true  and  accurate  was  placed  on  the  user.  These  programs  were  created  with  one 
assumption:  the  input  document  was  in  an  acceptable  format  for  the  application  and  free 
of  errors.  Thus,  the  programs  have  been  given  “perfect  inputs”  which  allows  the  testing  to 
focus  solely  on  data  extraction. 

The  automated  program  was  tested  first  in  a  two-phase  process.  During  the  first 
phase,  each  page  was  examined  separately  to  determine  what  unique  characteristics 
existed  to  distinguish  desired  elements  from  superfluous  information.  Once  the  unique 
characteristics  (if  any)  had  been  identified,  the  program  was  written  to  detect  them  and 
successfully  extract  their  contents.  Some  of  the  input  documents  followed  regulated 
correspondence  procedures,  allowing  the  leveraging  of  some  of  their  suitable 
characteristics.  During  the  second  phase,  all  five  of  the  input  documents  were  placed  into 
one  document  for  the  application  to  process  as  a  whole.  This  tested  the  ability  of  the 
automated  application  to  work  as  designed.  After  the  automated  application  was  tested, 
the  walkthrough  application  was  designed  and  tested  as  two  versions:  line-by-line  and 
page-by-page. 

1.  Automated  Program 

a.  File  Input  and  Closing 

In  order  to  begin  extracting  consumption  data  out  of  the  input  files,  the  program 
first  opens  an  input  file,  reads  each  line  into  a  list,  and  then  closes  the  input  file.  While 
this  part  of  the  program  produces  no  output,  it  handles  necessary  application  overhead  to 
start  the  process.  Figure  46  illustrates  the  coding  of  the  file  open  and  close  function. 
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#This  module  handles  opening  of  the  input  file, 

#  reading  in  of  inputs,  and  closing  of  the  file. 

rawData  =  []  #  Holds  the  original  lines 

lineCount  =  0 
rawDataSize  =  0 


def  fileOpenRoutine(fileName): 
global  lineCount 
global  rawDataSize 
lineCount  =  0 
rawDataSize  =  0 
file  =  open(fileName,  V) 
for  line  in  file: 
line  =  line.stripO 
if  (line  !=""): 
rawData. append(line) 
lineCount  =  lineCount  +  1 
rawDataSize  =  rawDataSize  +  1 
file.closeO  #  Close  the  input  file 


#  Open  the  file 

#  Strip  white  space 

#  Remove  blank  lines 

#  Add  it  to  the  list 


#- - Main  ■ 


-# 


print  ("Input  filename  to  analyze: ") 
fileName  =  input() 
fileOpenRoutine(fileName) 


Figure  46.  Automated  Program:  File  Open  and  Close 


The  program  begins  by  asking  the  user  to  input  a  filename.  Once  the  input  has 
been  given,  the  program  calls  the  first  function:  fileOpenRoutine(fileName).  This  function 
takes  the  name  of  a  fde  entered  by  the  end  user,  opens  it,  reads  in  each  line  from  the  fde, 
records  the  number  of  lines,  and  closes  the  fde.  To  aid  later  data  extraction  and 
readability,  white  space  is  stripped  off  the  beginning  and  end  of  each  sentence  and  blank 
lines  are  removed. 
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b.  Detect  and  Extract  Document  Information 

Now  that  the  input  file  has  been  read,  the  program  begins  by  identifying  and 
storing  the  important  identifying  information  of  each  document:  name,  date  of  publish, 
subject,  etc.  Figure  47  illustrates  the  coding  of  this  function. 


identifierinfo  =  []  #  Identifer  Information 

counter  =  0 

def  handleMarineCorpsOrderldentifyinglnformation(rawDataList): 
counter  =  9 
subj  = "" 

identifierlnfo.append(rawData[6])  #  Line  6  should  be  Marine  Corps  Order  and  # 
identifierlnfo.append(rawData[5])  #  Line  5  should  be  the  date 

while  "Ref  not  in  rawData[counter]:  #Subj  starts  on  line  9  and  ends  when  it  finds  Ref 
if  (subj  == 

subj  =  rawData[counter] 
else: 

subj  =  subj  +  " "  +  rawData[counter]  #  Process  a  multi-line  subject 
counter  =  counter  +  1 

identifierlnfo.append(subj) 

#- - Main - # 

if  "MCO"  in  rawData[3]:  #We  look  for  the  indication  of  a  Marine  Corps  Order 

handleMarineCorpsOrderldentifyinglnformation(rawData) 


Figure  47.  Automated  Program:  Detect  and  Extract  Document  Info 


This  part  of  the  program  begins  with  a  logic  test,  “if  ‘MCO’  in  rawData[3]  ... 
handleMarineCorpsOrderldentifyinglnformation(rawData)"  conducts  a  logical  test  to  see 
if  the  string  “MCO”  is  present  in  the  third  line  of  the  document.  Due  to  the  template- 
based  nature  of  the  document,  the  test  is  passed,  and  the  program  executes  the  function  to 
handle  a  MCO:  handleMarineCorpsIdentifyinglnformation(rawDataList).  It  is  important 
to  note  that  the  input  document  was  written  in  1997  and  may  not  represent  the  format  of  a 
current  MCO.  Thus,  it  is  important  for  a  final  and  fully-implemented  application  to 
follow  the  most  current  standards  and  policies. 
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The  function  handleMarineCorpsOrderIdentijyingInformation{rawDataLisi) 
handles  the  identification  of  the  document  name,  date  it  was  published,  and  subject  of  the 
document.  Since  a  MCO  follows  standard  correspondence  procedures  and  the  program 
was  given  perfect  input,  these  data  fields  were  extracted  out  of  the  document  by  using 
their  exact  line  numbers.  For  example,  MARINE  CORPS  ORDER  80 10. IE  resides  on 
line  six,  the  date  of  the  document  resides  on  line  five,  and  the  subject  of  the  document 
begins  on  line  nine  and  ends  when  the  first  occurrence  of  “Ref:”  is  encountered.  It  is 
common  for  the  subject  line  to  span  several  lines,  requiring  the  program  to  link 
(concatenate)  the  lines  together  in  order  to  accurately  present  the  subject  field.  Testing  for 
“Ref:”  also  represents  one  of  the  first  major  problems  with  implementing  an  automated 
program.  Testing  for  the  presence  of  this  exact  string  had  to  be  done  in  order  to  stop  the 
program  from  entering  an  endless  loop.  Since  the  program  was  given  perfect  inputs,  this 
was  not  a  problem.  However,  if  the  user  who  verifies  the  OCR  output  makes  an  error  and 
allows  a  different  string  such  as  “ref:”  to  go  through,  it  would  cause  this  particular 
program  to  crash. 

While  this  function  handles  the  detection  and  extraction  of  the  identifying 
information  for  the  front  page  of  a  MCO,  it  can  be  used  as  a  template  function  to  handle 
other  documents:  field  manuals,  technical  manuals,  etc.  Figure  48  illustrates  the  output  of 
this  function  once  the  program  successfully  detected  and  extracted  the  document’s 
identifying  information. 


Input  filename  to  analyze: 
pagel.txt 

Document:  MARINE  CORPS  ORDER  8010.1E 
Date:  15  Apr  97 

Subj:  CLASS  V(W)  PLANNING  FACTORS  FOR  FLEET  MARINE  FORCE  COMBAT 
OPERATIONS 


Figure  48.  Detect  and  Extract  Document  Info  Output 
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c. 


Detect  and  Extract  Table  Information 


This  part  of  the  program  is  responsible  for  detecting  tables  in  the  documents  and 
extracting  their  contents.  Figure  49  illustrates  the  coding  of  this  part  of  the  program 
(vl.O),  which  tests  for  the  presence  of  the  “Infantry-Heavy  Threat  Combat  Planning 
Factors  Table.”  In  the  example  given,  the  function  handleDataStructures(rawData)  is 
called  by  the  program  after  the  file  has  been  read  and  the  document’s  identifying 
information  has  been  recorded.  By  studying  the  input  document,  we  know  that  the  string, 
“Infantry-Heavy  Threat  Combat  Planning  Factors  Table,”  represents  a  table  inside  the 
document  that  contains  lines  of  consumption  data.  In  order  to  begin  data  extraction  from 
the  table,  a  logical  test  is  conducted  first  to  see  if  the  table  is  present  in  the  document:  “if 
‘Infantry-Heavy  Threat  Combat  Planning  Factors  Table’  in  rawData[counter]”  is  tested 
on  each  line  from  the  input  document.  After  the  program  detects  the  table,  it  reads  lines 
of  data  into  an  intermediate  data  structure,  “tableData,”  until  the  next  occurrence  of  the 
word  “Table”  is  encountered.  This  had  to  be  done  for  the  same  reason  for  which  the 
occurrence  of  “Ref:”  was  tested.  Due  to  the  current  format  of  the  input  document,  no 
distinguishable  landmarks  existed  for  the  program  to  stop.  The  only  way  to  get  the 
program  to  stop  was  by  testing  for  the  presence  of  a  new  table.  Again,  should  the  word 
table  not  exist  or  be  spelled  incorrectly;  the  program  would  enter  an  endless  loop, 
requiring  user  intervention. 

By  reading  the  table  contents  into  the  secondary  data  structure  “tableData”  we  can 
further  isolate  the  inputs  for  extraction.  Another  counter,  “tmpCounter,”  is  used  to 
prevent  the  main  program  counter,  which  controls  position,  from  being  adjusted,  saving 
the  correct  position  in  the  main  program  until  the  secondary  extraction  process  has 
completed.  Further  leveraging  known  information  from  the  input  document,  we  know 
that  the  DODIC  has  a  length  of  five  characters.  Thus,  the  logical  test 
“if(len(tableData[tmpCounter])  ==  5)”  is  used  to  indicate  when  a  new  line  of  data  is 
encountered.  Once  a  new  line  of  data  is  encountered,  the  previous  line  of  data, 
“dataLine,”  is  printed  to  the  screen.  Note:  some  of  the  logic  tests  created  for  this  program 
may  cause  false  positives.  For  example,  if  a  consumption  data  element  is  five  characters 
and  not  a  DODIC,  the  program  would  attempt  to  extract  the  data  based  on  a  false 
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positive.  However,  using  these  tests  became  necessary  because  no  other  distinguishable 
tests  could  be  created  based  on  the  format  of  the  input  document.  Thus,  in  order  to  reduce 
complexity  of  the  code  and  minimize  false  positives,  new  standards  for  the  input 
documents  may  be  needed.  Furthermore,  while  the  logic  test  checks  for  the  presence  of 
the  “Infantry-Heavy  Threat  Combat  Planning  Factors  Table,”  testing  can  be  done  on 
other  inputs.  For  example,  placing  “Infantry-Heavy  Threat  Combat  Planning  Factors 
Table,”  “Armor-Heavy  Threat  Combat  Planning  Factors  Table,”  and  the  “Composite 
Combat  Planning  Factors  Table”  in  a  list  data  structure  could  be  used  to  repetitively  test 
for  table  existence.  By  using  this  approach,  a  “consumption  data  dictionary”  that  contains 
known  occurrences  of  table  names  could  be  created.  Over  time,  this  dictionary  could 
track  all  known  occurrences  of  tables  and  provide  template-based  formatting  for  their 
extraction.  For  example,  if  the  program  were  to  recognize  and  detect  the  presence  of  an 
“ammunition  table”  and  have  a  pre-defmed  understanding  that  this  table  was  10  lines  of 
data,  the  program  could  locate  the  table  and  extract  10  lines  of  data.  This  would  help 
alleviate  some  of  the  hard-coded  complexities  in  previous  examples.  Figure  50  illustrates 
the  output  of  this  function  (vl.O). 
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def  handleDataStructures(rawData): 
global  rawDataSize 
counter  =  0 
tmpCounter  =  0 
tableData  =  [] 
tableDataSize  =  0 
dataLine  = "" 

while  (counter  <  rawDataSize): 

if  "Infantry-Heavy  Threat  Combat  Planning  Factors  Table"  in  rawData[counter]: 
counter  =  counter  +18  #  Move  past  table  header  and  column  headers 
while  "Table"  not  in  rawData[counter]: 
tableData. append(rawData[counter]) 
tableDataSize  =  tableDataSize  +  1 
counter  =  counter  +  1 

if  (counter  ==  rawDataSize):  #  There  are  no  more  inputs,  prevent 
break  #  reaching  out  of  bounds 

while  (tmpCounter  <  tableDataSize):  #  Construct  each  line  of  data 
if  (dataLine  ==""): 
dataLine  =  tableData[tmpCounter] 
tmpCounter  =  tmpCounter  +  1 

while  (True): 

if  (tmpCounter  >=  tableDataSize): 
break 

dataLine  =  dataLine  +  " "  +  tableData[tmpCounter] 
tmpCounter  =  tmpCounter  +  1 
if  (tmpCounter  >=  tableDataSize): 
break 

if  (len(tableData[tmpCounter])  ==  5): 
print  (dataLine) 
dataLine  = "" 

tmpCounter  =  tmpCounter  - 1 
break 

tmpCounter  =  tmpCounter  +  1 
counter  =  counter  +  1 


Figure  49.  Automated  Program:  Handle  Data  Structures  (vl  .0) 
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Input  filename  to  analyze: 

pagell.txt 

Found  Table: 

Infantry-Heavy  Threat  Combat  Planning  Factors  Table 
Known  column  headers  for  this  table: 

Weapon  Ammunition  GCE  RATES  Other  than  GCE  Rates 

Weapon  ID  Nomenclature  DODIC  Nomenclature  Daily  Daily  Basic  Daily  Daily  Basic 

ASSAULT  SUSTAIN  Allowance  ASSAULT  SUSTAIN  Allowance 
B0471  SQUAD  DEMOLITION  SET  M032  CHARGE,  DEMO  BLOCK  1  LB  TNT  15.00753  3.39605  48  15.00753 
B0471  SQUAD  DEMOLITION  SET  M130  CAP,  BLASTING  ELECTRIC  18.01945  4.00000  150  18.01945  2.03084 
B0471  SQUAD  DEMOLITION  SET  M131  CAP,  BLASTING  NON-ELECTRIC  60.00000  36.00000  260  60.00000 
B0471  SQUAD  DEMOLITION  SET  M456  CORD,  DETONATING  PETN  1401.30498  340.45345  1500 
B0471  SQUAD  DEMOLITION  SET  M670  FUZE,  BLASTING  TIME  500.00000  216.00000  3000  380.00000 
B0471  SQUAD  DEMOLITION  SET  M757  CHARGE,  ASSEMBLY  DEMOLITION  15.00753  3.39605  50  15.00753 


Figure  50.  Handle  Data  Struetures  Output  (vl.O) 


To  illustrate  how  new  poliey  standards  ean  help  reduee  the  amount  of  eoding  and 
overall  eomplexity  of  the  program,  two  landmarks  were  inserted  before  and  after  the 
table.  The  phrases  “Begin  Table”  and  “End  Table”  were  plaeed  in  the  input  doeument  as 
wrappers  around  the  table.  The  program  was  then  modified  to  deteet  these  phrases  and 
eonduet  data  extraetion.  These  ehanges  are  illustrated  in  Figure  51. 


d  ef  h  a  ridle  DstaStructu  re  s{r5 w  D  ata  [: 
global  rawDataSi-ze 
counter  =  0 
tabfeData  =  [] 
tabfe  Data  Size  =  0 
while  (counter  c  rawDataSize): 
if  ** Begi n  Ta b I e in  ra w D ata [co u nte r] : 
while  "End  Tabfe'^  not  in  rawDataicounter]: 
ta  ble  Data  .a  ppe  n  d  [  ra wOata  [co  u  nte  rlJi 
tableDataSlze  =  tableDataSize  -I- 1 
counter  =  counter  +  1 

Figure  5 1 .  Automated  Program:  Handle  Data  Structures  (v2.0) 


By  placing  these  phrases  into  the  input  file,  we  were  able  to  drastically  reduce  the 
complexity  of  the  program.  Rather  than  creating  programs  that  must  handle  very  specific 
tests  such  as  detecting  the  string  “Infantry-Heavy  Threat  Combat  Planning  Factors 
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Table,”  refinement  of  the  input  allows  more  general  usage,  regardless  of  table  name  and 
input  document  type. 

d.  File  Output 

This  part  of  the  program  focuses  on  placing  the  consumption  outputs  into  a  file. 
While  this  program  focuses  on  output  to  a  file,  the  output  produced  can  be  interpreted 
as  a  (key,  value)  pair.  This  key  value  would  take  the  following  form:  a  string  key 
consisting  of  the  documents  identifying  information  and  a  list  value  that  would  consist 
of  each  line  of  consumption  data.  Thus,  it  would  be  (string  identifyinglnformation,  list 
consumptionDataLines).  Although  this  program  concentrates  on  writing  outputs  to  a  file 
at  the  end  of  the  document’s  processing,  the  outputs  could  be  written  to  a  file  as  the 
program  works  through  each  consumption  table.  For  example,  when  a  table  is 
encountered,  the  entire  table  is  written  to  the  file,  the  data  structure  that  holds  the  table 
information  is  then  cleared,  and  the  process  is  free  to  move  to  the  next  table,  allowing 
repetitive  use  of  the  intermediate  data  structure.  Figure  52  illustrates  the  coding  of  this 
function. 


def  writeToFile(fileName,  table): 

outputFile  =  open  (fileName,  'w') 

for  element  in  table: 
outputFile.write(element  +  "\n") 

writeToFileC’outputs.txt",  table) 


Figure  52.  Automated  Program:  File  Output 

While  this  function  produces  no  visual  output,  it  places  all  of  the  contents  in  the 
provided  table  into  the  provided  filename.  The  contents  of  this  table  would  consist  of  all 
the  consumption  elements  extracted  from  the  input  document. 


79 


e.  Phase  Two  Test 

In  order  to  test  the  automated  program  for  phase  two,  all  of  the  input  files  were 
consolidated  into  one  file  named  “consolidatedinputs.txt.”  The  program  was  slightly 
modified  to  search  for  all  occurrences  of  the  tables  inside  of  the  consolidated  input  file. 
Again,  it  became  necessary  to  create  unnecessary  tests  based  on  false-positives  caused  by 
the  format  of  the  input  document.  The  exact  string  “Infantry-Heavy  Threat  Combat 
Planning  Factors”  was  found  on  the  first  page  of  the  input  document  in  the  enclosure 
section.  The  program  began  to  extract  data  from  this  point,  all  of  which  was  incorrect.  “If 
((counter  >  90)  and  (“Infantry-Heavy  Threat”  in  rawData[counter]))”  represents  the  logic 
test  that  had  to  be  created  in  order  to  access  the  table  at  the  correct  position.  This  could 
have  been  avoided  by  giving  the  program  only  the  sections  of  the  document  that 
contained  the  tables.  However,  this  requires  more  user  interaction.  Additionally,  the 
problem  of  determining  when  a  table  had  ended  presented  itself  again.  In  order  to  stop 
extracting  data  for  a  particular  table,  the  program  had  to  check  for  the  presence  of  the 
next  table.  Using  the  “Begin  Table”  and  “End  Table”  changes,  as  suggested  earlier, 
would  have  prevented  us  from  having  to  create  these  unnecessary  tests.  Should  these 
tables  not  exist  or  their  names  be  misspelled,  this  program  would  enter  an  endless  loop. 
Figure  53  illustrates  a  program  that  was  capable  of  extracting  consumption  data  elements 
from  all  the  tables  in  the  consolidated  input  file. 
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tablel  =  [] 
table2  =  [] 
tables  =  [] 

while  (counter  <  rawDataSize): 

if  {(counter  >90)  and  ("Infantry-Heavy  Threat”  in  rawData[counter])): 
tablel.append(rawData[counter]) 
counter  =  counter  +  1 

while  "COMBAT  PLANNING  FACTORS  FOR  SPECIAL  OPERATIONS"  not  in  rawData[counter]: 
tablel.append(rawData[counter]) 
counter  =  counter  +  1 

if  "COMBAT  PLANNING  FACTORS  FOR  SPECIAL  OPERATIONS"  in  rawData[counter]: 
table2.append(rawData[counter]) 
counter  =  counter  +  1 

while  "ARTILLERY  ANCILLARY  ITEMS"  not  in  rawData[counter]: 
table2.append(rawData[counter]) 
counter  =  counter  +  1 

if  "ARTILLERY  ANCILLARY  ITEMS"  in  rawData[counter]: 
tables. append(rawData[counter]) 
counter  =  counter  +  1 

while  len(rawData[counter])  >  0: 
tables. append(rawData[counter]) 
counter  =  counter  +  1 

if  counter  >=  rawDataSize: 
break 

counter  =  counter  +  1 


Figure  53.  Automated  Program:  Handle  Consolidated  Input  File 


/  Automated  Program  Summary 

The  automated  program  takes  only  one  user  input  to  begin  proeessing — a 
filename.  Onee  the  filename  has  been  entered,  the  program  exeeutes  automated  analysis 
of  the  file  using  pre-built  logie  tests.  Onee  the  program  has  finished  its  analysis,  the  user 
must  verify  eaeh  data  element  before  it  is  written  to  the  output  file.  This  part  of  the 
program  is  not  illustrated  beeause  it  is  very  similar  to  the  program  presented  in  the  next 
seetion. 
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While  the  goal  of  this  program  is  to  achieve  automation,  it  requires  extensive 
testing,  logic  creation,  and  still  requires  user  interaction  once  the  program  has 
autonomously  extracted  out  all  the  possible  consumption  data  elements  to  review  the 
results.  Some  of  the  logic  tests  used  by  this  program  are  highly  unnecessary  and  may 
ultimately  cause  the  program  to  fail.  For  example,  the  test  “If  ((counter  >  90)  and 
(“Infantry-Heavy  Threat”  in  rawData[counter]))”  would  be  necessary  unless  user 
selection  of  the  input  data  is  given  or  refinement  of  the  input  document  occurs.  While  the 
goal  of  creating  these  tests  was  to  increase  the  accuracy  of  the  output,  they  increase  the 
complexity  and  length  of  the  code,  require  additional  processing  power,  and  may  cause 
the  program  to  perform  slower  or  more  inefficiently.  Additionally,  they  may  only  work 
with  very  specific  inputs.  In  order  to  mitigate  this  problem,  refinement  of  the  input 
document  is  necessary.  By  creating  unique  identifiers  such  as  “Begin  Table”  and  “End 
Table,”  the  complexity  and  length  of  the  program  can  be  reduced  while  also  allowing  it  to 
accept  a  larger  variety  of  inputs. 

2.  Walkthrough  Programs 

While  the  automated  program  makes  use  of  several  functions,  this  program  only 
uses  one.  Instead  of  extracting  the  document’s  identifying  information  and  consumption 
elements  separately,  this  program  allows  the  user  to  step  through  each  line  of  the  input 
document  in  sequential  order,  prompting  the  user  to  keep  the  line  or  disregard  it.  Thus, 
the  need  to  separately  process  identifying  information  and  consumption  elements  can  be 
done  simultaneously  since  both  will  come  through  as  lines  of  input.  Using  this  approach, 
the  input  fde  can  be  analyzed  line-by-line  or  another  increment:  paragraph-by-paragraph, 
page-by-page,  etc.  Therefore,  two  approaches  using  this  application  are  presented:  line- 
by-line  and  page-by-page.  As  a  reference  point,  a  page  is  defined  as  45  lines  of  input.  The 
decision  to  use  45  lines  is  based  on  the  standard  MS  Word®  document  format.  A  one- 
page  document  with  one-inch  borders  can  hold  45  lines  of  information  in  Times  New 
Roman,  12-pitch  font. 
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a.  Line-by-line  Program 

First,  the  input  file  is  opened  and  each  line  of  the  document  is  read  into  a 
temporary  data  structure.  Afterwards,  the  user  is  presented  with  one  line  of  information  at 
a  time  and  is  prompted  to  verify  whether  or  not  it  is  a  data  element.  If,  and  only  if,  the 
user  enters  “yes,”  the  element  gets  placed  into  the  final  output  data  structure.  Once  the 
document  has  been  reviewed  and  no  inputs  remain,  the  data  structure  that  contains  all  the 
“yes”  responses  is  then  placed  into  an  output  file.  Figure  54  illustrates  the  coding  of  the 
line-by-line  program.  Figure  55  illustrates  a  snapshot  of  the  running  application. 


validlnformation  =  [] 

#  Store  the  final  data  elements 

def  manual  Walkthrough(rawDataList): 

for  element  in  rawDataList: 

print  ("Is  this  valid  information?") 

print  (element) 

response  =  input() 

if  (response  ==  "yes"): 

validlnformation. append(element)  #  If  yes,  add  it  to  the  list 

manualWalkthrough(rawData) 

#Call  the  function 

Figure  54.  Line-by-line  Program  (coding) 


Is  this  valid  information? 

DEPARTMENT  OF  THE  NAVY 
yes 

Is  this  valid  information? 

HEADQUARTERS  UNITED  STATES  MARINE  CORPS 
yes 

Is  this  valid  information? 

WASHINGTON,  DC  20380-0001 
no 

Is  this  valid  information? 

MCO  8010.1E 


Figure  55.  Line-by-line  Program  (running) 

The  main  strength  of  this  approach  is  its  simplicity.  The  main  weakness  of  this 
approach  is  its  lack  of  functionality.  As  long  as  there  is  only  one  valid  consumption 
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element  per  line  in  the  input  doeument,  an  end-user  ean  quiekly  and  aeeurately  walk 
through  the  doeument.  However,  if  the  eonsumption  data  element  is  split  apart  and  spread 
aeross  multiple  lines,  the  program  output  may  not  make  sense.  In  order  to  eorreet  this,  the 
input  doeument  must  be  further  proeessed  or  a  string  eoneatenation  proeedure  must  be 
ereated.  While  the  program  ean  quiekly  walk  through  eaeh  line,  the  end  user  may  find  it 
faster  to  view  multiple  lines  of  information  at  onee  with  the  ability  to  seleet  speeifie  lines 
or  ranges.  Thus,  the  page-by-page  program  offers  a  potential  performanee  inerease. 

b.  Page-by-page  Program 

This  program  is  similar  to  the  line-by-line  program  but  aims  to  speed  up  the 
review  proeess.  Figure  56  illustrates  the  eoding  of  this  program. 


def  manualWalkthrough(rawDataList): 
counter  =  1 
tmpCounter  =  0 
while  (True): 

print  ("Please  select  the  lines  of  valid  infornnation  by  line  #:") 
print  ("Separate  line  #'s  with  a  space  -  e.g.  1  3  15  31  44") 

while  ((counter  <  46)  and  (tnnpCounter  <  rawDataSize)): 
print(str(counter)  +  "."  +  " "  +  rawDataList[tnnpCounter]) 
tnnpCounter  =  tnnpCounter  +  1 
counter  =  counter  +  1 

response  =  input() 

responses  =  response. split(" ")  #  Split  the  inputs  by  their  space 
for  elennent  in  responses: 

validlnfornnation.append(rawDataList[tnnpCounter]) 
counter  =  1 

if  (tnnpCounter  ==  rawDataSize): 
print  ("Review  connplete.") 
break 

nnanualWalkthrough(rawData)  #  Call  the  function 


Figure  56.  Page-by-page  Program  (eoding) 
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The  main  strength  of  this  approach  is  its  speed.  The  main  weakness  is  its  need  for 
structured  and  accurate  input  by  the  user.  The  program  begins  by  prompting  the  user  to 
enter  specific  line  numbers  that  are  deemed  to  be  elements  or  consumption  data.  Before 
the  user  is  able  to  enter  input,  45  lines  of  data  are  presented  to  the  user  for  review.  The 
user  must  enter  structured  input  (line  numbers  separated  by  white  space)  that  corresponds 
to  each  correct  line.  Once  the  user  input  has  been  tied  to  corresponding  lines,  the 
respective  lines  are  transferred  to  the  final  output  data  structure.  Once  all  of  the  lines  have 
been  reviewed,  the  program  terminates.  An  extension  of  the  program  would  be  the  ability 
to  go  back  through  the  final  output  list  and  modify  and  review  the  values  as  a  second 
layer  of  precaution.  While  the  goal  is  to  speed  up  the  process,  it  places  more 
responsibilities  and  requirements  on  the  end  user.  Unless  failure  logic  is  added, 
inappropriate  responses  or  errors  in  the  input  will  cause  the  program  to  fail  or  raise 
exceptions.  Figure  57  illustrates  a  running  instance  of  the  program. 


Please  select  the  lines  of  valid  information  by  line  #: 

Separate  line  #'s  with  a  space  -  e.g.  1  3  15  31 44 

1.  DISTRIBUTION  STATEMENT  A:  Approved  for  public  release;  distribution 

2.  is  unlimited. 

3.  MCO  8010.1E 
4. 15  Apr  97 

5.  can  be  used  in  conjunction  with  MAGTF  II  to  determine  Class  V(W) 
requirements 

6.  for  operational  plans. 

1234 

Output: 

1.  DISTRIBUTION  STATEMENT  A:  Approved  for  public  release;  distribution 

2.  is  unlimited. 

3.  MCO  8010.1E 
4. 15  Apr  97 


Figure  57.  Page-by-page  Program  (running) 


3.  Program  Summary 

The  first  program  concentrates  on  automated  analysis  with  user  interaction  at  the 


end  of  the  automated  data  processing.  While  this  program  strives  to  achieve  automation, 
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additional  consumption  document  analysis  and  logic  test  creation  is  required.  The 
creation  of  eomplex  logic  tests  can  be  mitigated  by  further  refinement  of  the  input 
document.  The  second  program  concentrates  on  walkthrough  analysis  using  two 
approaches:  line-by-line  and  page-by-page.  While  the  line-by-line  program  offers 
simplieity  in  its  current  state,  it  laeks  funetionality  and  may  be  slower  than  the  second 
approach.  Using  the  page-by-page  approach,  speed  is  offered  at  the  loss  of  simplieity,  as 
well  as  added  reliance  on  user  correetness,  allowing  45  pages  of  information  to  be 
reviewed  at  a  time.  Sinee  consumption  data  may  be  “buried”  at  the  end  of  a  consumption 
document,  this  approach  allows  the  end-user  to  quickly  reach  their  desired  position  in  the 
document.  Regardless  of  which  program  is  used,  the  emphasis  remains  on  eorrecting  and 
ensuring  that  the  input  doeument  is  free  of  errors  and  is  in  the  most  optimal  format.  This 
can  be  accomplished  through  multiple  layers  of  internal  review  and  the  establishment  of 
new  standards  that  would  govern  all  future  eonsumption  doeuments. 
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V.  CONCLUSIONS 


A.  LESSONS  LEARNED 

1.  Choosing  OCR  Application 

Chapter  II  discussed  two  OCR  approaches:  free-form  and  template-based 
recognition.  Due  to  the  nature  of  the  consumption  documents  reviewed  by  this  thesis, 
free-form  analysis  was  chosen  over  template-based.  Although  template-based  OCR  can 
be  faster  than  free-form,  no  feasible  template  opportunities  presented  themselves. 
Additionally,  template-based  OCR  requires  a  new  template  be  created  for  every  instance 
of  a  table.  Thus,  free-form  OCR  is  the  most-preferred  method  for  the  current  state  of 
consumption  documents.  Should  consumption  data  be  presented  in  a  template  format  in 
future  standards  or  documents,  template-based  OCR  may  present  itself  as  an  opportunity. 

Chapter  IV  compared  three  OCR  applications:  an  open-source,  online  application, 
Microsoft  OneNote®,  and  Nuance  OmniPage®.  Of  the  three  applications,  the 
OmniPage®  software  offered  the  most-reliable  functionality  and  the  highest  accuracy 
rate  and  is  therefore  recommended  over  the  other  applications. 

OmniPage®  allowed  for  a  simultaneous  conversion  and  correction  process, 
removing  the  requirement  to  separately  correct  the  input  document  in  another  text  editor 
after  it  had  been  converted.  The  open-source,  online  application  was  more  accurate  and 
more  robust  than  OneNote®.  Although  the  open-source  application  claims  to  be  free,  the 
number  of  pages  that  can  undergo  OCR  are  limited  unless  additional  pages  are  purchased. 
Thus,  while  it  may  not  be  feasible  to  use  the  online  application,  the  OCR  conversion 
process  could  be  out-sourced  at  first  until  native  OCR  capability  is  acquired  or  created. 
OneNote®  proved  to  be  unreliable,  creating  numerous  spelling  errors  while  not  providing 
intelligent  functionality  for  creating  tables  or  data  structures. 

2.  Choosing  Application  Approach 

Chapter  IV  compared  two  programs  created  to  extract  consumption  information 
out  of  input  documents:  automated  and  walkthrough  analysis.  The  walkthrough  analysis 
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program  was  further  subdivided  into  two  approaches:  line-by-line  and  page-by-page. 
While  the  automated  program  was  able  to  autonomously  extract  consumption  elements 
out  of  the  input  document,  highly-complicated  and  restrictive  parsing  logic  had  to  be 
created.  Creation  of  this  logic  requires  that  the  programmer  understand  and  be  familiar 
with  the  nature  of  the  input  document.  Additionally,  a  programmer  would  need  to  see  an 
example  of  every  consumption  document  to  ensure  that  they  had  correctly  defined  all  the 
logic  statements.  Furthermore,  creation  of  these  restrictive  tests  should  be  minimized  to 
ensure  the  application  could  be  used  in  a  wide  range  of  environments.  While  the 
template-based  nature  of  a  consumption  document  such  as  a  MCO  allows  the  program  to 
accurately  extract  the  documents  identifying  information,  it  can  also  create  problems  that 
must  be  addressed.  For  example,  consumption  tables  are  commonly  referenced  as 
enclosures.  Thus,  logic  had  to  be  created  to  look  past  these  occurrences.  Additionally, 
user  interaction  must  occur  at  the  end  of  the  program  to  verify  that  the  program  correctly 
interpreted  and  extracted  all  the  possible  data  elements. 

Since  the  current  state  of  consumption  documents  may  require  extensive  logic  test 
creation,  the  goal  of  the  walkthrough  analysis  program  was  to  circumvent  this 
requirement.  Allowing  the  user  to  walk  through  the  input  document  in  a  line-by-line  or 
page-by-page  basis,  the  responsibility  of  verifying  consumption  data  is  placed  on  the  end- 
user  instead  of  the  program.  While  both  approaches  allow  the  end-user  to  walk  through 
the  input  document,  the  page-by-page  program  is  preferred  since  it  allows  for  rapid 
movement,  selection,  and  verification. 

Regardless  of  which  program  is  used,  the  need  to  refine  and  format  the  input 
document  remained  a  central  focus  throughout  the  process.  First,  the  input  document 
should  be  converted  to  an  appropriate  input  format  (.docx,  .txt,  .PDF)  for  the  extraction 
program.  In  this  thesis,  a  text  file  with  a  .txt  extension  was  used  to  keep  the  input  very 
simple  and  to  allow  the  native  Python  libraries  to  open  the  files.  Once  the  document  has 
been  placed  into  the  appropriate  format,  it  should  be  reviewed  and  reformatted  (spelling, 
table  creation,  etc.)  in  order  to  make  the  document  easier  to  read  for  the  end-user  of  the 
program.  While  many  of  these  steps  may  be  required  due  to  the  current  state  of 
consumption  documents  or  the  system  as  a  whole,  future  standards  and  policies  can  be 
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dictated  and  followed  to  reduce  the  steps  required  in  this  process.  Although  the  program 
focuses  on  output  to  a  fde,  it  can  be  altered  slightly  to  output  a  (key,  value)  pair.  The 
decision  as  to  what  output  to  use  is  based  on  the  storage  approach  used  by  the  end-user. 
This  part  of  the  problem  was  left  for  future  analysis  and  research. 

B.  RECOMMENDATIONS 

This  thesis  recommends  the  following: 

•  Establish  a  baseline  listing  and  collection  of  consumption  documents  in 
one  central  location.  In  order  to  create  a  reliable  automated  application, 
the  program  should  be  aware  of  all  known  iterations  of  a  consumption 
document. 

•  Convert  the  input  documents  using  OCR  software  that  primarily  focuses 
on  OCR  or  outsourcing  to  achieve  the  desired  input  format.  Although 
professional-grade  software  such  as  OmniPage®,  isn’t  free,  it  offers  the 
highest  accuracy  and  most  functionality  at  a  relatively  low  cost. 

•  Refine  and  utilize  a  page-by-page  walkthrough  analysis  program  to  extract 
and  upload  the  first  iteration  of  consumption  data  elements,  using  the 
baseline  listing  as  a  checklist. 

•  Refine  standard  policies  and  correspondence  procedures  for  the 
representation  of  consumption  data  in  future  documents. 

•  Once  the  baseline  has  been  established  and  refined  policies  have  been 
created,  refine  and  utilize  an  automated  analysis  program. 

Due  to  the  current  format  state  of  consumption  documents  and  the  lack  of  a 
centrally  -  located  baseline  listing,  implementation  of  an  automated  program  would  be 
ineffective.  Thus,  the  baseline  should  be  created  using  a  walkthrough  program  and  then 
gradually  migrated  to  a  point  where  an  automated  program  can  produce  accurate  results 
without  unnecessary  coding. 

C.  FUTURE  RESEARCH 

The  following  areas  represent  opportunities  for  future  research  and  development: 

•  Although  commercial-off-the-shelf  OCR  technology  was  compared,  a 
native  OCR  application  could  be  researched  and  created.  Conducting 
research  in  this  field  would  require  extensive  background  in  computer 
vision,  text  analysis,  and  application  coding  and  may  require  more  than 
one  thesis  to  fully-develop  the  application. 
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•  While  a  “bare-bones”  walkthrough  analysis  program  was  given,  the 
creation  and  refinement  of  a  walkthrough  application  might  be 
accomplished  in  the  scope  of  a  single  thesis. 

•  Further  advancement  and  creation  of  an  automated  program  would  be  best 
suited  for  a  single  thesis.  Should  this  avenue  be  pursued,  it  is 
recommended  that  the  researcher  have  a  background  understanding  and 
access  to  Marine  Corps  logistics  documents  and  be  proficient  in 
application  coding. 

•  While  the  programs  that  are  presented  make  use  of  safe  coding  practices, 
they  do  not  focus  on  security  vulnerabilities  that  may  or  may  not  be 
present.  Thus,  vulnerability  testing  could  be  conducted. 

•  This  thesis  addresses  how  inputs  can  be  placed  into  a  database.  It  does  not 
illustrate  how  this  data  can  best  be  presented  to  the  end  user.  Thus,  data 
access  and  representation  can  be  researched  from  a  Human  Computer 
Interaction  (HCI)  standpoint. 

D.  SUMMARY 

In  conclusion,  this  thesis  has  strived  to  provide  the  best  picture  for  the  way 
forward  by  conducting  background  research  into  OCR  and  presenting  multiple 
approaches  for  tackling  the  analysis  phase.  The  purpose  of  the  examples  given,  problems 
identified,  and  recommendations  presented  is  to  support  the  Marine  Corps  advance 
towards  automation  of  logistics  consumption  data  and  associated  planning,  allowing  its 
focus  to  remain  on  winning  the  nation’s  wars. 
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